Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams
AI-generated Key Points
- Language Models (LMs) are being studied for their capabilities in tackling high-stakes multiple-choice tests
- The study analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the Exame Nacional do Ensino Médio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities
- Different prompt strategies were tested, including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers
- The best performing model on the 2022 edition was found to be GPT-4 with CoT achieving an accuracy of 87%, largely surpassing GPT 3.5 by 11 points
- Comparisons were made with other studies that applied LM models to different domains such as United States Bar Examination and Certified Public Accountants (CPA) Examination where it was discovered that GPT 3.5 falls significantly short of human performance in analytical quantitative reasoning questions but demonstrates comparable performance to humans in questions that demand remembering, understanding and applying knowledge
- Researchers also fine tuned PALM on medical related question answering examples resulting in Med PALM which was evaluated on questions from the United States Medical Licensing Examination (USMLE) and provided answers in agreement with scientific consensus for 92.6% of the questions
- The study used two evaluation datasets: The ENEM Challenge and ENEM 2022 comprising 1754 ENEM questions spanning from 2009 to 2017 and eliminated questions requiring image comprehension (IC), mathematical reasoning (MR), having chemical elements (CE)
- Code and data used on experiments are available at https://github.com/piresramon/gpt4enem
Authors: Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto Lotufo, Rodrigo Nogueira
Abstract: The present study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests, represented here by the Exame Nacional do Ensino M\'edio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities. This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge, requiring understanding of information from diverse domains. For instance, a question may require comprehension of both statistics and biology to be solved. This work analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the 2009-2017 exams, as well as for questions of the 2022 exam, which were made public after the training of the models was completed. Furthermore, different prompt strategies were tested, including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers. On the 2022 edition, the best-performing model, GPT-4 with CoT, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.