Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

AI-generated keywords: Vid2Seq Pretraining Dense Video Captioning YT-Temporal-1B CVPR 2023

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Vid2Seq: a multi-modal single-stage dense event captioning model
  • Pretrained on narrated videos at scale
  • Incorporates special time tokens for event boundaries and textual descriptions in the same output sequence
  • Uses unlabeled narrated videos for training by reformulating sentence boundaries as pseudo event boundaries and transcribed speech sentences as pseudo captions
  • Surpasses state-of-the-art performance on various dense video captioning benchmarks including YouCook2, ViTT, and ActivityNet Captions
  • Strong generalization capabilities to tasks such as video paragraph captioning and video clip captioning
  • Performs well in few shot settings
  • Code available at https://antoyang.github.io/vid2seq.html
  • Paper accepted for presentation at CVPR 2023 with 18 pages and six figures
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, Cordelia Schmid

CVPR 2023 Camera-Ready; Project Webpage: https://antoyang.github.io/vid2seq.html ; 18 pages; 6 figures

Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html.

Submitted to arXiv on 27 Feb. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2302.14115v2

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In their work titled "Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning," Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid introduce Vid2Seq - a multi-modal single-stage dense event captioning model. This model is pretrained on narrated videos that are readily available at scale. The Vid2Seq architecture enhances a language model by incorporating special time tokens which enable the prediction of event boundaries and textual descriptions in the same output sequence. To address the limitation of lack of large-scale annotated datasets for training such unified models, the authors leverage unlabeled narrated videos for dense video captioning by reformulating sentence boundaries as pseudo event boundaries and using transcribed speech sentences as pseudo captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset surpasses state-of-the-art performance on various dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. It also demonstrates strong generalization capabilities to tasks such as video paragraph captioning and video clip captioning as well as performing well in few shot settings. The authors provide public access to their code at https://antoyang.github.io/vid2seq.html and their paper has been accepted for presentation at CVPR 2023 consisting of 18 pages with six figures.
Created on 20 Dec. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.