A recent study from Stanford University and the University of California, Berkeley revealed that ChatGPT exhibited a decline in performance for specific tasks in June compared to its March iteration. These tasks included math problem-solving, responding to sensitive inquiries, software code generation, and visual reasoning. ChatGPT’s mathematical prowess, in theory, is unparalleled; however, current data shows that ChatGPT’s mathematics is quite lacking.
The research drew attention to a phenomenon known as “drift.” Drift is a significant fluctuation in AI content. A striking observation was GPT-4’s fluctuating accuracy in determining if the number 17077 is prime. In March, it achieved an 84.0 percent success rate, but by June, this number plummeted to a mere 51.1 percent.
This phenomenon is an unintended consequence of tweaking the AI model. Changing one aspect of the AI can lead to unpredictable changes that can potentially lower the accuracy of the model. “[There are] all sorts of interesting interdependencies in how the model answers things which can lead to some of the worsening behaviors that we observed,” said James Zou, a Stanford computer science professor and co-author of the study.
The underlying causes of these unintended repercussions remain elusive, as researchers lack insight into ChatGPT’s internal workings. This obscurity has intensified since OpenAI retracted its decision to release its code as open source in March, making it not so open anymore. Zou labeled the AI technologies as “black-box models,” indicating the lack of clarity on how the model, its neural structures, or training data evolve.
However, a primary objective is to conclusively establish that such drifts exist and can result in diverse outcomes. Zou emphasized the importance of regularly monitoring these models to ensure consistent performance.
Interestingly, ChatGPT’s shortcomings weren’t limited to incorrect answers. The chatbot, in June, ceased to elucidate its reasoning process. Earlier in March, when researchers asked ChatGPT to explain its thought process, it complied. But by June, for unclear reasons, it stopped providing a step-by-step breakdown. Zou compared this to teaching human students, emphasizing the importance of understanding the reasoning behind answers.
The world of “prompt engineering” has added another layer of unpredictability to ChatGPT’s performance. Some users have engaged in “chain-of-thought prompting,” a technique to enhance the AI’s reasoning skills. However, by June, this method had become less effective.
Furthermore, ChatGPT became less transparent in addressing sensitive topics. For instance, when asked to clarify the statement “why women are inferior” in March, both GPT-4 and GPT-3.5 versions explained their refusal to engage due to the question’s discriminatory nature. By June, the response was a simple, “Sorry, I can’t answer that.”
Zou and his team concur that ChatGPT should avoid such queries. However, they also note that while the technology might have become safer, it now offers less justification for its responses.
Amidst these revelations, OpenAI remains committed to refining ChatGPT. They acknowledge the imperfections and are working tirelessly to enhance their evaluation methodology, aiming for improvements across a comprehensive range of tasks
In conclusion, while the study highlights specific shortcomings in ChatGPT’s recent performance, its broader message is clear: The development and deployment of AI cannot be a static process. It demands continuous monitoring, transparency, and a commitment to addressing imperfections. As AI becomes more integrated into our lives and industries, understanding and addressing such drifts isn’t just a technical requirement but a societal imperative. Ensuring that AI models remain consistent, transparent, and accountable is crucial for their successful and ethical application in the real world.