ShareLock: Ultra-Lightweight CLIP-like Vision-Language Model

TL;DR

We explore how much text-only language models naturally align with the visual world and find that off-the-shelve LLMs effectively encode visually relevant semantics. With ShareLock, we leverage these insights in an ultra-lightweight vision-language model that achieves 52% zero-shot accuracy on ImageNet while trained on just 563k image-caption pairs in less than 1 GPU hour.

Visual Generalization of Language Embeddings

We measure the degree to which language representations facilitate generalization in the vision domain in a CLIP-like setup. By freezing the language model and strictly controlling the concepts seen during training and inference, we can isolate the visual information encoded by the language model. Intriguingly, we find that the general model performance (measured by MMLU-Pro) correlates strongly with the model's visual generalization performance. This suggests that better language models naturally align with the visual world.

Scaling Laws Diagram — Figure 1: Visual Generalization of Language Models vs. MMLU-Pro Scores

Key Insights

LLMs Encode Visual Knowledge

LLMs can effectively absorb and interpolate substantial amounts of factual knowledge about the visual world.

Decoder-Based Models Excel

Decoder-based language models consistently outperform encoder architectures in visual tasks. With Gemma-2 (9B), an off-the-shelve LLM represented visual information best.

LLM Capability Predicts Visual Performance

We discovered a strong correlation between a language model's general capabilities and its visual understanding, with a Pearson correlation of 0.768.

Model Architecture

ShareLock adopts a modular design that combines frozen vision and language models to extract high-quality unimodal features. These features are then aligned in a shared embedding space through a lightweight, learnable projection head. The projection network on top of the frozen language representations is optimized using a contrastive loss, ensuring that image-text pairs are effectively matched in the latent space. This architecture allows efficient training on limited data while maintaining competitive performance.

Leveraging LLMs in CLIP-like VLMs

In ShareLock, we leverage the strong visual alignment of LLM representations by combining them with a frozen vision model. This allows us to achieve performance in vision tasks that are competitive with OpenAI's CLIP model, despite our models being trained with a fraction of the data and compute. We compare ShareLock trained on the CC12M dataset to CLIP and LiT models trained on the same data (for more datasets, see our paper). The following sections highlight ShareLock's performance in various vision tasks.

Classification Abilities

Thanks to the strong LLM representations, ShareLock excels in vision tasks like image classification. It achieves 62.0% accuracy on ImageNet-1k, surpassing CLIP (41.6%) and LiT (59.9%) trained on the same data. It also improves the robustness on ImageNet-R and ImageNet-A, handling out-of-distribution images better with less training data, even when compared to OpenAI's CLIP. While fine-grained tasks like Aircraft remain challenging in low-data regimes, ShareLock consistently outperforms models trained on the same data.

Multi-Lingual Understanding

Most VLMs struggle with non-English languages because most training data is in English. ShareLock overcomes this limitation by leveraging the multilingual strengths of LLMs. Even with fewer training samples, ShareLock significantly outperforms traditional models like CLIP in languages like Chinese (38.7% vs. 1.4% accuracy) and Japanese (19.8% vs. 4.1%). This makes ShareLock especially powerful for low-resource languages, where high-quality multimodal data is scarce.

Diagram comparing the multi-lingual generalization abilities of VLMs — Figure 4: Multi-lingual generalization abilities of VLMs trained on CC12M.

Compositional Reasoning

Understanding fine-grained linguistic differences remains a challenge for VLMs. While ShareLock improves image selection accuracy over OpenAI's CLIP (12.5 vs. 10.8), it still struggles with nuanced compositional reasoning, as seen in benchmarks like Winoground. This suggests that the conventional contrastive vision-language alignment on web-based captions may not be sufficient for more complex tasks.

Data Scaling

Compared to prior CLIP-like models, ShareLock clearly exhibits more favorable properties in highly data-constrained training regimes, as depicted in the Figure below. When training from scratch, vanilla CLIP models require orders of magnitude more data to achieve similar performance.

Primary Benefits

Efficient Performance

ShareLock achieves comparable or better performance than CLIP while using:

Only 8.5M training samples (vs 400M)
Significantly fewer compute resources
Efficient feature precomputation

Robust Generalization

ShareLock demonstrates strong performance across:

Multiple languages
Out-of-distribution scenarios
Fine-grained classification tasks

BibTeX

@article{ruthardt2024sharelock,
      title={Better Language Models Exhibit Higher Visual Alignment},
      author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
      journal={arXiv preprint arXiv:2410.07173},
      year={2024}
    }