Sigmoid Loss for Language Image Pre-Training

AI-generated keywords: Sigmoid Loss

AI-generated Key Points

  • Proposal of a simpler alternative to standard contrastive learning with softmax normalization in the form of a sigmoid loss
  • Decoupling batch size from task definition for scaling up batch sizes and better performance at smaller batch sizes
  • Comparison of sigmoid loss with standard softmax loss in CLIP and LiT setups, showing superiority of sigmoid loss for batch sizes smaller than 16k
  • Simplicity and efficiency of distributed implementation of sigmoid loss without needing operations across the full batch
  • Successful training of SigLiT model at a large batch size of one million due to symmetric nature of sigmoid loss
  • Importance of disentangling batch size from loss function for training efficiency and quality in language-image pre-training tasks
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer

Xiaohua and Lucas contributed equally
License: CC BY 4.0

Abstract: We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. With only four TPUv4 chips, we can train a Base CLIP model at 4k batch size and a Large LiT model at 20k batch size, the latter achieves 84.5% ImageNet zero-shot accuracy in two days. This disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training.

Submitted to arXiv on 27 Mar. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2303.15343v1

, , , , In their paper titled "Sigmoid Loss for Language Image Pre-Training," Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer propose a simpler alternative to standard contrastive learning with softmax normalization in the form of a sigmoid loss. This loss operates solely on image-text pairs and eliminates the need for a global view of pairwise similarities for normalization. By decoupling the batch size from the task definition, the sigmoid loss allows for further scaling up of batch sizes while also performing better at smaller batch sizes. The authors compare the proposed sigmoid loss with the standard softmax loss across various setups, particularly focusing on two prominent approaches for image-text learning: CLIP and LiT. They introduce sigmoid language image pre-training (SigLIP) and sigmoid LiT (SigLiT), finding that the sigmoid loss outperforms the softmax loss significantly when the batch size is smaller than 16k. As the train batch size increases, however, the performance gap between the two losses diminishes. One key advantage of the sigmoid loss is its simplicity and efficiency in distributed loss implementation. It requires no operation across the full batch, simplifying implementation and boosting efficiency. The symmetric nature of the sigmoid loss allows for successful training of a SigLiT model at a large batch size of one million. Overall, this research sheds light on how disentangling batch size from loss function can impact training efficiency and quality in language-image pre-training tasks. The findings suggest that while growing batch sizes can offer benefits up to a certain point, more reasonable batch sizes may be sufficient to achieve similar quality results without sacrificing efficiency. This study paves the way for further exploration into improving both quality and efficiency in language-image pre-training methodologies.
Created on 03 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.