CogVLM: Visual Expert for Pretrained Language Models

AI-generated keywords: CogVLM Visual Language Model Deep Fusion Cross-Modal Benchmarks NLP Tasks

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu , Juanzi Li , Yuxiao Dong , Ming Ding and Jie Tang have introduced CogVLM - a powerful open-source visual language foundation model.
  • CogVLM incorporates a trainable visual expert module in the attention and FFN layers to bridge the gap between image features and language models.
  • CogVLM-17B has achieved state-of-the-art performance on 10 classic cross-modal benchmarks including NoCaps , Flicker30k captioning , RefCOCO , RefCOCO+ , RefCOCOg , Visual7W , GQA ScienceQA , VizWiz VQA and TDIUC .
  • It ranks second on VQAv2 , OKVQA , TextVQA , COCO captioning etc., surpassing or matching PaLI-X 55B.
  • The codes and checkpoints for CogVLM are available at https://github.com/THUDM/CogVLM.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, Jie Tang

Abstract: We introduce CogVLM, a powerful open-source visual language foundation model. Different from the popular shallow alignment method which maps image features into the input space of language model, CogVLM bridges the gap between the frozen pretrained language model and image encoder by a trainable visual expert module in the attention and FFN layers. As a result, CogVLM enables deep fusion of vision language features without sacrificing any performance on NLP tasks. CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks, including NoCaps, Flicker30k captioning, RefCOCO, RefCOCO+, RefCOCOg, Visual7W, GQA, ScienceQA, VizWiz VQA and TDIUC, and ranks the 2nd on VQAv2, OKVQA, TextVQA, COCO captioning, etc., surpassing or matching PaLI-X 55B. Codes and checkpoints are available at https://github.com/THUDM/CogVLM.

Submitted to arXiv on 06 Nov. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2311.03079v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu , Juanzi Li , Yuxiao Dong , Ming Ding and Jie Tang have introduced CogVLM - a powerful open-source visual language foundation model. Unlike the popular shallow alignment method that maps image features into the input space of a language model without considering their interactions with each other; CogVLM bridges this gap by incorporating a trainable visual expert module in the attention and FFN layers. This enables deep fusion of vision language features without compromising performance on NLP tasks. CogVLM-17B has achieved state-of-the-art performance on 10 classic cross-modal benchmarks including NoCaps , Flicker30k captioning , RefCOCO , RefCOCO+ , RefCOCOg , Visual7W , GQA ScienceQA , VizWiz VQA and TDIUC . It also ranks second on VQAv2 , OKVQA , TextVQA , COCO captioning etc., surpassing or matching PaLI-X 55B. The codes and checkpoints for CogVLM are available at https://github.com/THUDM/CogVLM.
Created on 10 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.