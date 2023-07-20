OpenAI's creation and popular Generative AI chatbot, ChatGPT, has started losing its capability, a research paper by Stanford University and University of California, Berkley stated.

"We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks, we find that the performance and behaviour of both GPT-3.5 and GPT-4 can vary greatly over time. Overall, our findings show that the behaviour of the “same” LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality," writes Lingjiao Chen, Matei Zaharia and James Zou.

What does the report reveal on ChatGPT?

The study evaluated the March 2023 and June 2023 versions of widely used large language models GPT-3.5 and GPT-4, on various tasks which include mathematical problems, sensitive questions, code generation, and visual reasoning.

Source: Stanford University, UC Berkeley

The study conducted by Lingjiao Chen, Matei Zaharia, and James Zou demonstrated that the behaviour of GPT-3.5 and GPT-4 has varied significantly over a relatively short amount of time. This further emphasises the need to evaluate LLM behaviour regularly. The study also recommends enterprises and companies implement a monitoring analyst.

It further adds that the GPT-4 released in March 2023 performed very well in comparison to the GPT-4 version in June 2023. The findings also showed that the behaviour of the “same” LLM (Large Language Model) service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.

This also comes at a time when recent reports revealed that ChatGPT started losing its charm in terms of the user base and allegations of information theft and copyright infringement by various prominent personalities.

However, the team OpenAI has proceeded with the claims of making its respective tools smarter and updated than its preceding version. However, the details of the study published point that the issues with the tool get noticed when people start using it more heavily.

Mathematical problems

The study which analyses the performance of GPT-4 and GPT-3.5 tests the mathematical ability of the LLM models. A set of 500 questions were given to the model to analyse its results. The response noted by the team highlights that the accuracy of GPT-4 dropped from 97.6 per cent in March to 2.4 per cent in June, whereas GPT-3.5's accuracy increased from 7.4 per cent to 86.8 per cent.

Source: Stanford University, UC Berkeley

GTP-4's responses became more compact and the number of generated on an average basis decreased from 821.2 in March to 3.8. in June. A possible explanation as explained in the paper is the drift of a chain of thoughts. This interesting phenomenon indicates that the same prompting approach, even those widely adopted such as chain-of-thought, could lead to substantially different performance due to LLM drifts.

Sensitive questioning and answering

Another major criterion that the paper followed was responses to the sensitive questions that have been the main concerns of generative AI, as the issues of social biases, personal information and toxic texts have been there since inception. A data set containing 100 sensitive queries was asked and responses were recorded. The major trends observed were that GPT-4 answered fewer questions from 21 per cent in March to just 5 per cent in June, whereas GPT-3.5 answered from 2 per cent to 8 per cent. The text generation was also observed to be dropped from 600 to 140.

Source: Stanford University, UC Berkeley

The reasons cited as per the study were that GPT-4 became more terse and offered fewer explanations when it refused to answer a query. GPT-4 refused to answer the inappropriate query in both March and June. However, it generated a whole paragraph to explain the rejection reasons in March, but simply produced “Sorry, but I cannot assist with that”. A similar phenomenon happened to GPT-3.5 too. This suggests that these LLM services may have become safer, but also provide less rationale for refusing to answer certain questions.

Code Generation and visual reasoning

Another area of study is the most prominent and widely used area of generative AI. For GPT-4, the ability of code generation dropped from 52 per cent in March to 10 per cent in June. The drop was large for GPT-3.5 as well, from 22 per cent to 2 per cent. The number of characters, however, improved for GPT-4 by 20 per cent. Each generated response was sent to the LeetCode online judge for evaluation for checking. Explanations highlighted were that the versions released in June added extra non-code text to their generations.

Source: Stanford University, UC Berkeley

In terms of visual reasoning, there was a 2 per cent improvement in the exact match rate from March to June. The generation length remained roughly the same. The study further highlights that for more than 90 per cent of visual puzzle queries, the March and June versions produced the same generation. These services’ overall performance was also low: 27.4 per cent for GPT-4 and 12.2 per cent for GPT-3.5.