It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization

AI-generated keywords: Test-time memorization

AI-generated Key Points

  • Authors explore efficient architectural backbones inspired by human cognitive attentional bias
  • Introduce alternative attentional bias configurations to stabilize training procedures
  • Reinterpret forgetting mechanisms as retention regularization, introducing novel forget gates
  • Present Miras framework for designing deep learning architectures based on associative memory architecture, attentional bias objective, retention gate, and memory learning algorithm
  • Introduce three novel sequence models—Moneta, Yaad, and Memora—that outperform existing linear RNNs in performance
  • Experimental results show varying strengths of models within Miras based on design choices
  • Models excel in tasks such as language modeling, commonsense reasoning, and recall-intensive activities
  • Evaluation of scaling pattern shows all variants of Miras outperform baselines with increased context length due to expressive memory architecture
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, Vahab Mirrokni

License: CC BY 4.0

Abstract: Designing efficient and effective architectural backbones has been in the core of research efforts to enhance the capability of foundation models. Inspired by the human cognitive phenomenon of attentional bias-the natural tendency to prioritize certain events or stimuli-we reconceptualize neural architectures, including Transformers, Titans, and modern linear recurrent neural networks as associative memory modules that learn a mapping of keys and values using an internal objective, referred to as attentional bias. Surprisingly, we observed that most existing sequence models leverage either (1) dot-product similarity, or (2) L2 regression objectives as their attentional bias. Going beyond these objectives, we present a set of alternative attentional bias configurations along with their effective approximations to stabilize their training procedure. We then reinterpret forgetting mechanisms in modern deep learning architectures as a form of retention regularization, providing a novel set of forget gates for sequence models. Building upon these insights, we present Miras, a general framework to design deep learning architectures based on four choices of: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. Our experiments show different design choices in Miras yield models with varying strengths. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.

Submitted to arXiv on 17 Apr. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2504.13173v1

, , , , In their paper "It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization," authors Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni explore efficient architectural backbones to enhance foundation models. They draw inspiration from human cognitive attentional bias to reconceptualize neural architectures like Transformers and Titans as associative memory modules that learn key-value mappings through an internal objective called attentional bias. The authors observe that existing sequence models primarily use dot-product similarity or L2 regression objectives for their attentional bias. To address this limitation, they introduce alternative attentional bias configurations and their approximations to stabilize training procedures. Additionally, they reinterpret forgetting mechanisms in deep learning architectures as retention regularization, introducing novel forget gates for sequence models. Building on these insights, the authors present Miras—a framework for designing deep learning architectures based on choices of associative memory architecture, attentional bias objective, retention gate, and memory learning algorithm. They introduce three novel sequence models—Moneta, Yaad, and Memora—that surpass existing linear RNNs in performance while maintaining fast parallelizable training processes. Experimental results demonstrate that different design choices within Miras yield models with varying strengths. Certain instances of Miras excel in tasks such as language modeling, commonsense reasoning, and recall-intensive activities—outperforming even Transformers and other modern linear recurrent models. Furthermore,<kgd> the authors evaluate the scaling pattern of their models by varying model size and context window length</kgd>. Results show that all variants of Miras outperform state-of-the-art baselines when increasing context length due to its expressive memory architecture. Overall, this paper provides a comprehensive exploration of test-time memorization mechanisms in deep learning architectures and offers a promising framework for designing efficient neural network structures with improved performance across various tasks.
Created on 08 Dec. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.