More benchmarks need to be conducted to study these trends. It's still too early to determine how accurate this study is. That’s why we also see the variation on feedback, based on use case - some can swear it’s the same, for others, it got terrible Tucpek, Reddit So while it’s the same on their test, it may be much worse on other tests, like those in the paper. Problem is, no test is completely comprehensible and it surely would help if they expanded a bit on testing suite. So in their view, it didn’t get any dumber, but it did got a lot cheaper. If they see regressions, they roll back and try something different. So they are tweaking gpt to provide same quality answers with less resources and test them a lot. OpenAI is trying to lessen the costs of running chatGPT, since they are losing a lot of money. Users part of the r/ChatGPT community on Reddit expressed a cocktail of feelings and theories about the key findings of the report, as highlighted below: Whereas, in June, both models blatantly refused to give a response to the same query. They further attributed the disparities to the "drifts of chain-of-thoughts’ effects."īoth LLMs gave a detailed response in March when asked about sensitive questions, citing their incapability to respond to prompts with traces of discrimination. The answer overlap between their March and June versions was also small for both services." stated the Stanford Researchers. On the other hand, there was about 40% growth in GPT-3.5’s response length. In addition, GPT-4’s response became much more compact: its average verbosity (number of generated characters) decreased from 821.2 in March to 3.8 in June. "GPT-4’s accuracy dropped from 97.6% in March to 2.4% in June, and there was a large improvement of GPT-3.5’s accuracy, from 7.4% to 86.8%. However, the issue was patched in June, with the model showing enhancements in terms of its performance. However, the same results could not be replicated in June as the model skipped the chain-of-thought instruction and outrightly gave the wrong response.Īs for GPT-3.5, it stuck to the chain-of-thought format but gave out the wrong answer initially. And it was apparent that there was a great performance drift, with the GPT-4 model following the chain-of-thought prompt and ultimately giving the correct answer in March. Stanford Researchers Performance Analysisįirst up, both models were tasked to solve a math problem, with the researchers closely monitoring the accuracy and answer overlap of GPT-4 and GPT-3.5 between the March and June versions of the models. We hope releasing the datasets and generations can help the community to understand how LLM services drift better. Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). In a nutshell, there are many interesting performance shifts over time. Here are the main findings by the researchers after evaluating the performance of the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on the four types of tasks highlighted above:
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |