🕵🏼‍♂️ ShareLock: An Ultra-Lightweight CLIP-like Vision-Language Model

1FunAI Lab, University of Technology Nuremberg
2Intelligent Imaging, TNO
3Department of Computer Science, University of Copenhagen

TL;DR

ShareLock is an ultra-lightweight vision-language model that achieves competitive multimodal performance by leveraging frozen features from state-of-the-art unimodal models. Trained on just 563k image-caption pairs, it achieves 51% zero-shot accuracy on ImageNet and outperforms existing methods in low-data regimes, with a total training time of 1 GPU hour.

Model Architecture

Model Architecture Diagram
Figure 1: Diagram of the ShareLock model architecture

ShareLock adopts a modular design that combines frozen vision and language models to extract high-quality unimodal features. These features are then aligned in a shared embedding space through a lightweight, learnable projection head. The projection network on top of the frozen langauge representations is optimized using a contrastive loss, ensuring that image-text pairs are effectively matched in the latent space. This architecture allows efficient training on limited data while maintaining competitive performance.

Results

Compared to prior CLIP-like models, ShareLock clearly exhibits more favorable properties in highly data-constrained training regimes, as depicted in below Figure. When training from scratch, vanilla CLIP models require orders of magnitude more data to achieve similar performance. Additionally, ShareLock shares similar improvment trajectories, suggesting comparable scaling characteristics and the ability to capitalize on the availability of larger datasets. These findings underline the effectiveness of utilizing features of state-of-the-art unimodal models in a multimodal setting.

Scaling Laws Diagram
Figure 2: Scaling laws of various CLIP-like models.

BibTeX

@article{ruthardt2024sharelock,
      title={Do Better Language Models Have Crisper Vision?},
      author={Jona Ruthardt and Gertjan J. Burghouts and Serge Belongie and Yuki M. Asano},
      journal={arXiv preprint arXiv:2410.07173},
      year={2024}
    }