When do you need Chain-of-Thought Prompting for ChatGPT?

AI-generated keywords: Chain-of-Thought ChatGPT Large Language Model Instruction Finetuning Pretraining

AI-generated Key Points

  • Study explores effectiveness of Chain-of-Thought (CoT) prompting on ChatGPT
  • CoT has been shown to improve reasoning tasks on LLMs like GPT-3
  • Unclear if CoT is still effective on ChatGPT
  • CoT is no longer effective for certain tasks like arithmetic reasoning on ChatGPT
  • ChatGPT often achieves best performance on these tasks and can generate CoT without explicit instructions
  • Suggests ChatGPT may have already been trained with CoT and implicitly memorized the instruction
  • Analysis highlights risk of overfitting or bias towards instructions in IFT training of LLMs
  • Indicates possible leakage of pretraining recipe allowing verification of dataset and instruction used in training ChatGPT
  • Experiments provide new baseline results for ChatGPT on various reasoning tasks
  • Offers insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Jiuhai Chen, Lichang Chen, Heng Huang, Tianyi Zhou

License: CC BY-NC-SA 4.0

Abstract: Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from Large Language Models~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.

Submitted to arXiv on 06 Apr. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2304.03262v2

The study explores the effectiveness of Chain-of-Thought (CoT) prompting on ChatGPT, a more recent instruction finetuned (IFT) Large Language Model (LLM). CoT prompting has been shown to improve the accuracy of reasoning tasks on LLMs like GPT-3. However, it is unclear whether CoT is still effective on ChatGPT. Surprisingly, CoT is no longer effective for certain tasks such as arithmetic reasoning on ChatGPT, while remaining effective for other reasoning tasks. In fact, ChatGPT often achieves the best performance on these tasks and can generate CoT even without explicit instructions. This suggests that ChatGPT may have already been trained with CoT and has implicitly memorized the instruction. The analysis highlights the potential risk of overfitting or bias towards instructions introduced during IFT which is becoming more common in training LLMs. It also indicates possible leakage of the pretraining recipe allowing verification of whether a dataset and instruction were used in training ChatGPT. The experiments conducted in this study provide new baseline results for ChatGPT on various reasoning tasks and offer novel insights into LLM's profiling, instruction memorization and pretraining dataset leakage.
Created on 02 Aug. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.