We explore how much text-only language models naturally align with the visual world and find that off-the-shelve LLMs effectively encode visually relevant semantics. With ShareLock, we leverage these insights in an ultra-lightweight vision-language model that achieves 52% zero-shot accuracy on ImageNet while trained on just 563k image-caption pairs in less than 1 GPU hour.
We measure the degree to which language representations facilitate generalization in the vision domain in a CLIP-like setup. By freezing the language model and strictly controlling the concepts seen during training and inference, we can isolate the visual information encoded by the language model. Intriguingly, we find that the general model performance (measured by MMLU-Pro) correlates strongly with the model's visual generalization performance. This suggests that better language models naturally align with the visual world.
LLMs can effectively absorb and interpolate substantial amounts of factual knowledge about the visual world.
Decoder-based language models consistently outperform encoder architectures in visual tasks. With Gemma-2 (9B), an off-the-shelve LLM represented visual information best.
We discovered a strong correlation between a language model's general capabilities and its visual understanding, with a Pearson correlation of 0.768.
ShareLock adopts a modular design that combines frozen vision and language models to extract high-quality unimodal features. These features are then aligned in a shared embedding space through a lightweight, learnable projection head. The projection network on top of the frozen language representations is optimized using a contrastive loss, ensuring that image-text pairs are effectively matched in the latent space. This architecture allows efficient training on limited data while maintaining competitive performance.
In ShareLock, we leverage the strong visual alignment of LLM representations by combining them with a frozen vision model. This allows us to achieve performance in vision tasks that are competitive with OpenAI's CLIP model, despite our models being trained with a fraction of the data and compute. We compare ShareLock trained on the CC12M dataset to CLIP and LiT models trained on the same data (for more datasets, see our paper). The following sections highlight ShareLock's performance in various vision tasks.
Thanks to the strong LLM representations, ShareLock excels in vision tasks like image classification. It achieves 62.0% accuracy on ImageNet-1k, surpassing CLIP (41.6%) and LiT (59.9%) trained on the same data. It also improves the robustness on ImageNet-R and ImageNet-A, handling out-of-distribution images better with less training data, even when compared to OpenAI's CLIP. While fine-grained tasks like Aircraft remain challenging in low-data regimes, ShareLock consistently outperforms models trained on the same data.
Most VLMs struggle with non-English languages because most training data is in English. ShareLock overcomes this limitation by leveraging the multilingual strengths of LLMs. Even with fewer training samples, ShareLock significantly outperforms traditional models like CLIP in languages like Chinese (38.7% vs. 1.4% accuracy) and Japanese (19.8% vs. 4.1%). This makes ShareLock especially powerful for low-resource languages, where high-quality multimodal data is scarce.
Understanding fine-grained linguistic differences remains a challenge for VLMs. While ShareLock improves image selection accuracy over OpenAI's CLIP (12.5 vs. 10.8), it still struggles with nuanced compositional reasoning, as seen in benchmarks like Winoground. This suggests that the conventional contrastive vision-language alignment on web-based captions may not be sufficient for more complex tasks.
Compared to prior CLIP-like models, ShareLock clearly exhibits more favorable properties in highly data-constrained training regimes, as depicted in the Figure below. When training from scratch, vanilla CLIP models require orders of magnitude more data to achieve similar performance.
ShareLock achieves comparable or better performance than CLIP while using:
ShareLock demonstrates strong performance across:
@article{ruthardt2024sharelock,
title={Better Language Models Exhibit Higher Visual Alignment},
author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
journal={arXiv preprint arXiv:2410.07173},
year={2024}
}