How accurate is 98 percent on Turnitin’s AI detection?


Source: Student Edge

On April 4, Turnitin released its AI detection software to 10,700 educational institutions in an effort to counter AI-assisted cheating. Turnitin’s new software falsely flagged the use of AI. For example, in the case of Lucy Goetz, a high school student, her essay on socialism was flagged as AI-generated despite the writing being her own work. This case shed light on the potentially catastrophic consequences for students who are falsely accused. It is indeed a hard decision for teachers: if AI-use is ignored, cheating incidents would surge; however, the reliance on faulty AI detection technology puts innocent students in jeopardy.

Turnitin claims that false positives, such as Goetz’s writing, occur around one percent of the time. Unlike plagiarism detection, which scours through a database of content to compare submitted works with those in the database using methods such as text fingerprinting and string matching, there is no source material to provide reference points for AI-generated writing.

OpenAI, the parent company of ChatGPT, developed its own AI detection software; however, it claimed that it only correctly identifies AI-generated text 26 percent of the time. Computer scientists from University of Maryland also conducted a study on methods to bypass detection tools and found that the detectors were severely unreliable.

AI detection is complicated and difficult because the software involves statistics. Eric Wang, Turnitin’s vice president of AI, explains how AI writing is “extremely consistently average.” A problem then arises when human writers are consistently average as well. In particular fields, such as Economics or STEM, there are styles and templates that people use to write. These writings are the most likely to be mislabeled as being AI-generated. 

As AI software develops at breakneck pace, experts believe that the detector is not a reliable long-term solution. One recent development in the field of AI was the controllability of temperature. Temperature probability measures the randomness of predictions. Low temperature yields less variation; however, high temperature writing is very diverse but with more errors. As AIs are improving their high-temperature models, it is becoming closer to human writing, making it even harder for AI-detection algorithms to define the line between human and AI. Turnitin’s claim of 98 percent accuracy has not been backed by any studies or experiments. They have not been transparent about their method of AI detection.

There remains no clear-cut solution to accurately flagging AI-generated writing. Inevitably, there will be students who rely on AI for their work and also innocent students who are punished due to the inaccuracies of a computer software. Predictions and computer-generated confidence are not enough to counter the development of AI.