Long-range Language Modeling with Self-retrieval

AI-generated keywords: Retrieval-Pretrained Transformer Long-Range Language Modeling GPT-NeoX Tokenizer Pythia Scoring Model Books3 Dataset

AI-generated Key Points

Retrieval-augmented language models (LMs) are gaining attention
Retrieval-Pretrained Transformer (RPT) is proposed to jointly train a retrieval-augmented LM from scratch for long-range language modeling tasks
RPT model computes query representations for recently generated text chunks in a long document and uses them to retrieve earlier chunks located potentially tens of thousands of tokens before
Information from retrieved chunks is fused into the LM representations to predict the next target chunk
The retriever component is trained with a semantic objective that aims to retrieve chunks that increase the probability of the next chunk according to a reference LM
RPT model evaluated on four long-range language modeling tasks spanning books, code, and mathematical writing, where documents are generally long across all datasets
RPT model has 12 layers with hidden dimension d=1024 and eight attention heads with head dimension 128; CCA is applied every two layers, and two neighbors are used unless mentioned otherwise.
RPT demonstrates promising results in long range language modeling tasks by improving retrieval quality and perplexity compared to strong baselines.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ohad Rubin, Jonathan Berant

arXiv: 2306.13421v1 - DOI (cs.CL)

License: CC BY 4.0

Abstract: Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as a native component of the LM, but added to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt to one another. In this work, we propose the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch for the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM. We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT improves retrieval quality and subsequently perplexity across the board compared to strong baselines.

Submitted to arXiv on 23 Jun. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2306.13421v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

Retrieval-augmented language models (LMs) have gained significant attention in recent times. To address the issue of limited ability to adapt between the retriever and LM components when added separately, this work proposes the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch for long-range language modeling tasks. The RPT model computes query representations for recently generated text chunks in a long document and uses them to retrieve earlier chunks located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. The retriever component is trained with a semantic objective that aims to retrieve chunks that increase the probability of the next chunk according to a reference LM. The RPT model is evaluated on four long-range language modeling tasks spanning books, code, and mathematical writing, where documents are generally long across all datasets. At training time, sequences of length L=16384 tokens are split into four devices each consuming 4096 tokens. The decoder stack takes 2048 tokens as input in a sliding window approach containing ℓ=32 chunks of length m=64. Rotary Positional embedding is employed while all models are trained for 500K steps on TPUv4-64 with an effective batch size of 217 tokens. The GPT-NeoX tokenizer is used for all models trained and Pythia serves as our scoring language model with deduplicated 1.4B parameter version used for scoring top-20 BM25 candidates. The RPT model has 12 layers with hidden dimension d=1024 and eight attention heads with head dimension 128; CCA is applied every two layers, and two neighbors are used unless mentioned otherwise. The Books3 dataset is a corpus of books released as part of the Pile, containing a vast collection of literary works from different domains which we use as a long range language modeling benchmark for the first time. Overall, RPT demonstrates promising results in long range language modeling tasks by improving retrieval quality and perplexity compared to strong baselines.

- Retrieval-augmented language models (LMs) are gaining attention
- Retrieval-Pretrained Transformer (RPT) is proposed to jointly train a retrieval-augmented LM from scratch for long-range language modeling tasks
- RPT model computes query representations for recently generated text chunks in a long document and uses them to retrieve earlier chunks located potentially tens of thousands of tokens before
- Information from retrieved chunks is fused into the LM representations to predict the next target chunk
- The retriever component is trained with a semantic objective that aims to retrieve chunks that increase the probability of the next chunk according to a reference LM
- RPT model evaluated on four long-range language modeling tasks spanning books, code, and mathematical writing, where documents are generally long across all datasets
- RPT model has 12 layers with hidden dimension d=1024 and eight attention heads with head dimension 128; CCA is applied every two layers, and two neighbors are used unless mentioned otherwise.
- RPT demonstrates promising results in long range language modeling tasks by improving retrieval quality and perplexity compared to strong baselines.

There is a new way to teach computers how to understand and write long pieces of text. It's called Retrieval-Pretrained Transformer (RPT). RPT helps computers find information from earlier parts of the text to help them write better. The computer learns how to do this by practicing with different types of writing, like books and math problems. RPT has 12 layers and uses special techniques to help it work better. Overall, RPT is a promising new tool for helping computers write better and understand more complex texts. Definitions- Retrieval-augmented language models: A type of computer program that helps machines understand and generate language. - Transformer: A type of neural network architecture used in natural language processing tasks. - Long-range language modeling tasks: Tasks that require understanding and generating long pieces of text. - Perplexity: A measure of how well a language model predicts the next word in a sequence. - Semantic objective: A goal or purpose related to meaning or understanding in natural language processing tasks.

Retrieval-Pretrained Transformer (RPT): A New Architecture for Long-Range Language Modeling

In recent years, Retrieval-augmented language models (LMs) have gained significant attention due to their ability to improve the performance of long-range language modeling tasks. However, there is a limitation when it comes to adapting between the retriever and LM components when added separately. To address this issue, researchers at Google Brain have proposed the Retrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from scratch.

What is RPT?

The RPT model computes query representations for recently generated text chunks in a long document and uses them to retrieve earlier chunks located potentially tens of thousands of tokens before. Information from retrieved chunks is fused into the LM representations to predict the next target chunk. The retriever component is trained with a semantic objective that aims to retrieve chunks that increase the probability of the next chunk according to a reference LM.

How Does RPT Work?

At training time, sequences of length L=16384 tokens are split into four devices each consuming 4096 tokens. The decoder stack takes 2048 tokens as input in a sliding window approach containing ℓ=32 chunks of length m=64. Rotary Positional embedding is employed while all models are trained for 500K steps on TPUv4-64 with an effective batch size of 217 tokens. The GPT-NeoX tokenizer is used for all models trained and Pythia serves as our scoring language model with deduplicated 1.4B parameter version used for scoring top-20 BM25 candidates. The RPT model has 12 layers with hidden dimension d=1024 and eight attention heads with head dimension 128; CCA is applied every two layers, and two neighbors are used unless mentioned otherwise.

Evaluation Results

The RPT model was evaluated on four long range language modeling tasks spanning books, code, and mathematical writing where documents are generally long across all datasets: Books 3 dataset which contains literary works from different domains; CodeSearchNet Corpus which consists of Python source code snippets; MathQA which contains questions related to mathematics; and WikiText103 which consists mostly Wikipedia articles written in English . Overall, RPT demonstrated promising results in these tasks by improving retrieval quality and perplexity compared to strong baselines such as GTP2 or BERT based architectures..

Conclusion

In conclusion, this research paper presents Retrieval Pretrained Transformer (RTP), an architecture designed specifically for joint training between retrieval augmented LMs from scratch for long range language modeling tasks such as books 3 dataset , CodeSearchNet Corpus , MathQA , WikiText103 etc . It demonstrates better performance than existing strong baselines like GTP2 or BERT based architectures by improving both retrieval quality & perplexity . This research could pave way towards more efficient & accurate natural language processing applications in future

Created on 27 Jun. 2023

Assess the quality of the AI-generated content by voting

Score: 0

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

67.5%

Improving language models by retrieving from trillions of tokens

cs.CL

62.0%

In-Context Retrieval-Augmented Language Models

cs.CL

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.