Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
AI-generated Key Points
- Pushdown Layers introduced as a new self-attention layer for Transformer language models
- Address the challenge of capturing recursive structure in human language
- Use a stack tape to track estimated depths of tokens in an incremental parse
- Allow Transformer models to softly modulate attention and learn to "skip" over closed constituents
- Achieve significantly better syntactic generalization compared to standard Transformer models
- 3-5 times more sample-efficient than standard Transformer models
- WIKITREES dataset created consisting of over 100 million tokens from Wikipedia articles
- Pushdown Transformers exhibit drastically more sample-efficient syntactic generalization compared to base Transformers on WIKITREES dataset
- Staged finetuning of GPT2-medium with Pushdown Layers improves language understanding tasks beyond just syntactic generalization
- Replacing final 12 self-attention blocks with Pushdown Layers achieves better performance on several GLUE text classification tasks
Authors: Shikhar Murty, Pratyusha Sharma, Jacob Andreas, Christopher D. Manning
Abstract: Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail recursive structure and exhibit sample-inefficient syntactic generalization. This work introduces Pushdown Layers, a new self-attention layer that models recursive state via a stack tape that tracks estimated depths of every token in an incremental parse of the observed prefix. Transformer LMs with Pushdown Layers are syntactic language models that autoregressively and synchronously update this stack tape as they predict new tokens, in turn using the stack tape to softly modulate attention over tokens -- for instance, learning to "skip" over closed constituents. When trained on a corpus of strings annotated with silver constituency parses, Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization, while maintaining similar perplexities. Pushdown Layers are a drop-in replacement for standard self-attention. We illustrate this by finetuning GPT2-medium with Pushdown Layers on an automatically parsed WikiText-103, leading to improvements on several GLUE text classification tasks.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.