Better Language Models Exhibit Higher Visual Alignment

1FunAI Lab, University of Technology Nuremberg
2Intelligent Imaging, TNO
3Department of Computer Science, University of Copenhagen
TMLR 2026

TL;DR

We explore how much text-only language models naturally align with the visual world and find that off-the-shelf generative LLMs effectively encode visually relevant semantics. With ShareLock, we leverage these insights in an ultra-lightweight vision-language model that achieves 52% zero-shot accuracy on ImageNet while trained on just 563k image-caption pairs in less than 1 GPU hour.

Visual Generalization of Language Embeddings

First, we measure the degree to which language representations facilitate generalization in the vision domain in a CLIP-like setup. To prevent weakened generalization claims due to concept leakage (Fig. 1, left), our protocol enforces strict concept separation between explicitly aligned concepts during training and unaligned concepts during evaluation (Fig. 1, middle). Freezing the vision and language backbones and only training a lightweight adapter network allows us to assess the visual information encoded by the language model.

Scaling Laws Diagram
Figure 1: VLM (e.g., CLIP) vs. Alignment Probing (Ours) Training and Evaluation.

Intriguingly, we find that the general model performance (measured by MMLU-Pro) correlates strongly with the model's visual generalization performance ((Pearson-r: 0.768)). This suggests that better language models naturally align with the visual world.

Scaling Laws Diagram
Figure 2: Visual generalization vs. language comprehension.

Moreover, on models trained with identical data and matched model sizes, we find that decoder-based architectures demonstrate improved visual generalization compared to encoder-based models. This indicates that the autoregressive training objective may better capture visually relevant semantics.

Scaling Laws Diagram
Figure 3: Encoder- vs. decoder-based language models.

Key Insights

👁️ LLMs Encode Visual Knowledge

LLMs can effectively absorb and interpolate substantial amounts of factual knowledge about the visual world.

🚀 Decoder-Based Models Excel

Decoder-based language models consistently outperform encoder architectures in visual tasks. With Gemma-2 (9B), an off-the-shelf LLM represented visual information best.

📊 LLM Capability Predicts Visual Performance

We discovered a strong correlation between a language model's general capabilities and its visual understanding, with a Pearson correlation of 0.768.

🕵🏼‍♂️ ShareLock VLM

ShareLock adopts a modular design that combines frozen vision and language models to extract high-quality unimodal features. These features are then aligned in a shared embedding space through a lightweight, learnable projection head. The projection network on top of the frozen language representations is optimized using a contrastive loss, ensuring that image-text pairs are effectively matched in the latent space. This architecture allows efficient training on limited data while maintaining competitive performance.

Model Architecture Diagram
Figure 4: Diagram of the ShareLock model architecture

In ShareLock, we leverage the strong visual alignment of LLM representations by combining them with a frozen vision model. This allows us to achieve performance in vision tasks that are competitive with OpenAI's CLIP model, despite our models being trained with a fraction of the data and compute. We compare ShareLock trained on the CC12M dataset to CLIP and LiT models trained on the same data (for more datasets, see our paper). The following sections highlight ShareLock's performance in various vision tasks.

Classification Abilities

Thanks to the strong LLM representations, ShareLock excels in vision tasks like image classification. It achieves 62.0% accuracy on ImageNet-1k, surpassing CLIP (41.6%) and LiT (59.9%) trained on the same data. It also improves the robustness on ImageNet-R and ImageNet-A, handling out-of-distribution images better with less training data, even when compared to OpenAI's CLIP. While fine-grained tasks like Aircraft remain challenging in low-data regimes, ShareLock consistently outperforms models trained on the same data.

Diagram comparing the image classification abilities of VLMs
Figure 5: Multilingual generalization abilities of VLMs.

Multilingual Understanding

Most VLMs struggle with non-English languages because most training data is in English. ShareLock overcomes this limitation by leveraging the multilingual strengths of LLMs. Even with fewer training samples, ShareLock significantly outperforms traditional models like CLIP in languages like Chinese (38.7% vs. 1.4% accuracy) and Japanese (19.8% vs. 4.1%). This makes ShareLock especially powerful for low-resource languages, where high-quality multimodal data is scarce.

Diagram comparing the multilingual generalization abilities of VLMs
Figure 6: Multilingual generalization abilities of VLMs.

Compositional Reasoning

Understanding fine-grained linguistic differences remains a challenge for VLMs. While ShareLock improves image selection accuracy over OpenAI's CLIP (12.5 vs. 10.8), it still struggles with nuanced compositional reasoning, as seen in benchmarks like Winoground. This suggests that the conventional contrastive vision-language alignment on web-based captions may not be sufficient for more complex tasks.

Scaling Laws Diagram
Figure 7: Compositional reasoning on Winoground.

Data Scaling

Compared to prior CLIP-like models, ShareLock clearly exhibits more favorable properties in highly data-constrained training regimes, as depicted in the figure below. When training from scratch, vanilla CLIP models require orders of magnitude more data to achieve similar performance.

Scaling Laws Diagram
Figure 8: Scaling laws of various CLIP-like models.

🏆 Primary Benefits

⏱️ Efficient Performance

ShareLock achieves comparable or better performance than CLIP while using:

  • Only 8.5M training samples (vs 400M)
  • Significantly fewer compute resources
  • Efficient feature precomputation

🛡️ Robust Generalization

ShareLock demonstrates strong performance across:

  • Multiple languages
  • Out-of-distribution scenarios
  • Fine-grained classification tasks

📜 BibTeX

@article{ruthardt2024sharelock,
      title={Better Language Models Exhibit Higher Visual Alignment},
      author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
      journal={Transactions on Machine Learning Research (TMLR)},
      year={2026}
    }