Titans: Learning to Memorize at Test Time

AI-generated keywords: Neural Memory Module Titans Architecture Long-Term Memory Decaying Mechanism Experimental Evaluations

AI-generated Key Points

  • Novel approach to designing a long-term neural memory module
  • Prioritization of events that violate expectations for better memorization
  • Introduction of a decaying mechanism to manage limited memory capacity
  • Optimization of meta neural network with mini-batch gradient descent, momentum, and weight decay
  • Titans architecture with three hyper-heads: Core, Long-term Memory, and Persistent Memory
  • Three variants of Titans incorporating memory as context, layer, or gated branch
  • Experimental evaluations demonstrating superiority over modern recurrent models and hybrid variants
  • Outperformance of Transformers in various tasks and context window sizes
  • Training procedures involving LLama 2 tokenizer, AdamW optimizer, cosine annealing schedule, batch size, and weight decay settings
  • Superior performance of the neural memory module in Titans compared to other models including Transformer++
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ali Behrouz, Peilin Zhong, Vahab Mirrokni

License: CC BY 4.0

Abstract: Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention allows attending to the entire context window, capturing the direct dependencies of all tokens. This more accurate modeling of dependencies, however, comes with a quadratic cost, limiting the model to a fixed-length context. We present a new neural long-term memory module that learns to memorize historical context and helps attention to attend to the current context while utilizing long past information. We show that this neural memory has the advantage of fast parallelizable training while maintaining a fast inference. From a memory perspective, we argue that attention due to its limited context but accurate dependency modeling performs as a short-term memory, while neural memory due to its ability to memorize the data, acts as a long-term, more persistent, memory. Based on these two modules, we introduce a new family of architectures, called Titans, and present three variants to address how one can effectively incorporate memory into this architecture. Our experimental results on language modeling, common-sense reasoning, genomics, and time series tasks show that Titans are more effective than Transformers and recent modern linear recurrent models. They further can effectively scale to larger than 2M context window size with higher accuracy in needle-in-haystack tasks compared to baselines.

Submitted to arXiv on 31 Dec. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.00663v1

This paper presents a novel approach to designing a long-term neural memory module that can effectively learn to memorize at test time and incorporate it into an architecture. Inspired by human long-term memory systems, the neural memory module prioritizes events that violate expectations, making them more memorable. A decaying mechanism is introduced to better manage limited memory capacity by considering the proportion of memory size and data surprise. This mechanism optimizes a meta neural network with mini-batch gradient descent, momentum, and weight decay. The Titans architecture consists of three hyper-heads: Core for short-term memory processing using attention with a limited window size, Long-term Memory for storing long past information, and Persistent Memory for encoding task-independent knowledge. Three variants of Titans incorporate memory as context, layer, or gated branch. Experimental evaluations on various tasks demonstrate the superiority of Titans over modern recurrent models and hybrid variants. They also outperform Transformers with the same context window size and show competitive performance with Transformers using the entire context while scaling effectively to larger than 2M context window sizes. The training procedures involve using LLama 2 tokenizer with a vocabulary size of 32K and training length of 4K tokens. AdamW optimizer with a learning rate of 4𝑒-4 and cosine annealing schedule are employed with a batch size of 0.5M tokens and weight decay of 0.1. In language modeling tasks focusing on perplexity and accuracy measures, the neural memory module in Titans outperforms other models including Transformer++, showcasing the importance of weight decay and momentum in achieving superior performance.
Created on 16 Jan. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.