LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

AI-generated keywords: Large Language Model

AI-generated Key Points

  • Recent advancements in LLM-driven chat assistant systems have integrated memory components for more accurate and personalized responses.
  • LongMemEval is a benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
  • Commercial chat assistants and long-context LLMs showed a 30% accuracy drop in memorizing information across sustained interactions when tested on LongMemEval.
  • A unified framework was proposed to optimize long-term memory design through key design choices across indexing, retrieval, and reading stages.
  • Memory design optimizations such as session decomposition, fact-augmented key expansion, and time-aware query expansion significantly improved memory recall and downstream question answering on LongMemEval.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, Dong Yu

License: CC BY 4.0

Abstract: Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. This paper introduces LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into four design choices across the indexing, retrieval, and reading stages. Built upon key experimental insights, we propose several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. Experiment results show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI.

Submitted to arXiv on 14 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.10813v1

, , , , Recent advancements in large language model (LLM)-driven chat assistant systems have led to the integration of memory components, resulting in more accurate and personalized responses. However, the long-term memory capabilities of these systems during sustained interactions have not been fully explored. To address this gap, LongMemEval was introduced as a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. LongMemEval presents a significant challenge to existing long-term memory systems with 500 meticulously curated questions embedded within freely scalable user-assistant chat histories. Commercial chat assistants and long-context LLMs showed a 30% accuracy drop in memorizing information across sustained interactions when tested on LongMemEval. To optimize long-term memory design, a unified framework was proposed by breaking it down into four key design choices across indexing, retrieval, and reading stages. Experimental insights led to the proposal of several memory designs including session decomposition for optimizing value granularity, fact-augmented key expansion for enhancing the index structure, and time-aware query expansion for refining the search scope. These optimizations significantly improved both memory recall and downstream question answering on LongMemEval. In conclusion, this study highlights the need for more sophisticated memory mechanisms in LLM-based chat assistants to achieve personalized and reliable conversational AI. LongMemEval serves as a valuable benchmark to drive future advancements in long-term memory capabilities for chat assistants. The reproducibility statement emphasizes transparency in sharing resources and code to enable other researchers to replicate results and build upon findings for further advancements in the field.
Created on 13 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 1

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.