Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams

AI-generated keywords: Language Models ENEM GPT-3.5 GPT-4 Chain-of-Thought

AI-generated Key Points

  • Language Models (LMs) are being studied for their capabilities in tackling high-stakes multiple-choice tests
  • The study analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the Exame Nacional do Ensino Médio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities
  • Different prompt strategies were tested, including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers
  • The best performing model on the 2022 edition was found to be GPT-4 with CoT achieving an accuracy of 87%, largely surpassing GPT 3.5 by 11 points
  • Comparisons were made with other studies that applied LM models to different domains such as United States Bar Examination and Certified Public Accountants (CPA) Examination where it was discovered that GPT 3.5 falls significantly short of human performance in analytical quantitative reasoning questions but demonstrates comparable performance to humans in questions that demand remembering, understanding and applying knowledge
  • Researchers also fine tuned PALM on medical related question answering examples resulting in Med PALM which was evaluated on questions from the United States Medical Licensing Examination (USMLE) and provided answers in agreement with scientific consensus for 92.6% of the questions
  • The study used two evaluation datasets: The ENEM Challenge and ENEM 2022 comprising 1754 ENEM questions spanning from 2009 to 2017 and eliminated questions requiring image comprehension (IC), mathematical reasoning (MR), having chemical elements (CE)
  • Code and data used on experiments are available at https://github.com/piresramon/gpt4enem
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Desnes Nunes, Ricardo Primi, Ramon Pires, Roberto Lotufo, Rodrigo Nogueira

License: CC BY 4.0

Abstract: The present study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests, represented here by the Exame Nacional do Ensino M\'edio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities. This exam poses challenging tasks for LMs, since its questions may span into multiple fields of knowledge, requiring understanding of information from diverse domains. For instance, a question may require comprehension of both statistics and biology to be solved. This work analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the 2009-2017 exams, as well as for questions of the 2022 exam, which were made public after the training of the models was completed. Furthermore, different prompt strategies were tested, including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers. On the 2022 edition, the best-performing model, GPT-4 with CoT, achieved an accuracy of 87%, largely surpassing GPT-3.5 by 11 points. The code and data used on experiments are available at https://github.com/piresramon/gpt-4-enem.

Submitted to arXiv on 29 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.17003v1

The importance of language in various domains has been a topic of interest for researchers and the present study aims to explore the capabilities of Language Models (LMs) in tackling high-stakes multiple-choice tests. The Exame Nacional do Ensino Médio (ENEM), a multidisciplinary entrance examination widely adopted by Brazilian universities, poses challenging tasks for LMs as its questions may span into multiple fields of knowledge, requiring understanding of information from diverse domains. The study analyzed responses generated by GPT-3.5 and GPT-4 models for questions presented in the 2009-2017 exams, as well as for questions of the 2022 exam which were made public after the training of the models was completed. Different prompt strategies were tested including the use of Chain-of-Thought (CoT) prompts to generate explanations for answers. For experiments on the ENEM 2022 dataset few shot prompts were used with three examples from different knowledge areas to induce the model to generate responses in an expected format. Additionally, few shot prompts with CoT techniques were investigated to enhance their performance. The best performing model on the 2022 edition was found to be GPT-4 with CoT achieving an accuracy of 87%, largely surpassing GPT 3.5 by 11 points. The code and data used on experiments are available at https://github.com/piresramon/gpt 4 enem . Comparisons were made with other studies that applied LM models to different domains such as United States Bar Examination and Certified Public Accountants (CPA) Examination where it was discovered that GPT 3.5 falls significantly short of human performance in analytical quantitative reasoning questions but demonstrates comparable performance to humans in questions that demand remembering, understanding and applying knowledge. Researchers also fine tuned PALM on medical related question answering examples resulting in Med PALM which was evaluated on questions from the United States Medical Licensing Examination (USMLE) and provided answers in agreement with scientific consensus for 92 6% of the questions. GPT 4 demonstrated performance comparable to humans across multiple professional and academic benchmarks such as achieving a score within top 10% participants on simulated bar exam It also largely surpasses Med PALM on version USMLE benchmark The study used two evaluation datasets: The ENEM Challenge and ENEM 2022 comprising 1754 ENEM questions spanning from 2009 2017 and eliminated questions requiring image comprehension (IC), mathematical reasoning (MR), having chemical elements (CE).
Created on 12 Apr. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.