SteerViT: Steerable Visual Representations

1University of Technology Nuremberg
2Carnegie Mellon University
3International Institute of Information Technology, Hyderabad
*Equal contribution  |  †Equal advising

TL;DR

We introduce Steerable Visual Representations, a new class of visual representations whose global and local features can be steered with natural language. Our SteerViT method turns any pretrained Vision Transformer into a query-aware visual encoder by injecting text directly into the layers of the visual encoder via lightweight gated cross-attention. The result is a vision-centric multimodal representation that can focus on the object, attribute, or abstraction you care about, while remaining useful for transfer.

Given an image and a text query, SteerViT produces prompt-conditioned local and global visual features by steering the vision encoder itself, rather than only fusing text after visual encoding.

Across conditional retrieval, targeted attention, personalized object discrimination, and industrial anomaly segmentation, SteerViT shows that language can guide what vision encodes without giving up the strengths of pretrained visual representations.

Why Steerable Visual Representations?

Compare vision encoders by the regions they attend to and the semantics captured by their global embeddings.

Query Image

The same input image is used for both models.

Original query image

DINOv2

Query-agnostic baseline

DINOv2 CLS attention heatmap
Top-4 retrievals
DINOv2 retrieval 1 DINOv2 retrieval 2 DINOv2 retrieval 3 DINOv2 retrieval 4

SteerViT

Prompt-steerable visual encoder

SteerViT CLS attention heatmap
Top-4 retrievals
SteerViT retrieval 1 SteerViT retrieval 2 SteerViT retrieval 3 SteerViT retrieval 4

Pretrained ViTs such as DINOv2 provide strong generic visual features, but they are typically query-agnostic: without any extra input, they tend to encode the most prominent object or scene in the image. This is useful for general-purpose vision, but limiting for tasks that require focusing on a less salient object, a specific attribute, or a different level of semantic abstraction.

Existing multimodal systems do not fully solve this problem. Cross-modal encoders usually fuse text after visual encoding, so the visual backbone itself is still not steerable. MLLMs are more flexible, but their representations are often language-centric and come with much higher computational cost.

SteerViT takes a different route: instead of conditioning language on vision, it conditions vision on language. The goal is not just to produce the right answer for one downstream task, but to create a new class of prompt-aware visual representations that remain broadly useful.

Steerability versus representation quality
Figure 1: Steerability vs. representation quality.
While prior approaches often trade off these two properties, SteerViT achieves a new Pareto frontier by offering high steerability while retaining representation quality.

Core Contributions

Steerable Visual Representations

Language changes both where the model attends and what the global representation encodes, including the semantic granularity and grouping principle of the learned features.

Lightweight Conditioning of Frozen ViTs

SteerViT turns any pretrained ViT into a query-aware encoder through lightweight gated cross-attention, adding only 20M multimodal parameters.

Steering Without Sacrificing Utility

The resulting representations remain broadly useful, supporting retrieval, targeted attention, semantic control, and zero-shot transfer to new downstream domains.

How SteerViT Works

1. Start from any pretrained ViT.
All original parameters remain frozen, preserving the structure and strengths of the pretrained visual backbone.

2. Inject language into visual processing.
SteerViT interleaves lightweight cross-attention blocks within the ViT to enable visual tokens to attend to text tokens. A zero-initialized tanh gate ensures the model starts exactly as the original frozen ViT and only gradually incorporates language during training.

3. Training with referential grounding.
We use a patch-wise referential segmentation objective that requires the model to indicate which regions correspond to the prompt. This encourages the model to learn how to steer the visual representation toward the queried concept, rather than just adding extra information at the output level.

SteerViT architecture diagram
Figure 2: SteerViT architecture.
Text is injected into a frozen ViT via lightweight gated cross-attention, trained with a patch-level grounding objective.

Steering Global Semantics with Text

To test whether global image features can really be redirected by text, we introduce CORE, a conditional retrieval benchmark built around small, scene-dependent objects placed into cluttered scenes.

SteerViT reaches 95.9% acc@1 on CORE, compared to 43.5% for DINOv2. This shows that text conditioning can shift the representation from the dominant scene concept (e.g., “kitchen”) toward the actual object of interest (e.g., “fruit bowl”).

CORE benchmark results
Figure 3: CORE benchmark.
DINOv2 clusters by scene; SteerViT reorganizes the embedding space around the queried concept.
Qualitative CORE retrieval examples
Figure 4: Qualitative retrieval on CORE.
Even when the queried object is small and visually secondary, SteerViT retrieves images that contain the correct object rather than only matching the scene.

Text Enables Targeted Attention

Steerability is not only visible in nearest-neighbor retrieval. It also changes how the model aggregates information.

On the MOSAIC benchmark, where four images are stitched into a single composite image, SteerViT redirects the frozen [CLS] attention toward prompt-relevant regions. DINOv2 still focuses on the most salient objects, while SteerViT can attend to small or non-dominant targets.

Quantitatively, this raises attention localization from 14.3 to 42.5 PR-AUC.

MOSAIC attention maps
Figure 5: Prompt-guided attention steering.
The prompt changes where the global representation gathers evidence from.

Text Controls Semantic Granularity

Changing the prompt smoothly reshapes the projected embedding space.

SteerViT does not just change what the model attends to. It also changes how the visual embedding space is organized.

Broad prompts such as "animal" produce coarse semantic groupings, while more specific prompts such as "bird" sharpen the separation of individual concepts.

Text can even steer the space by compositional attributes rather than category. For example, prompting with "eye" groups together classes that share this property, even across semantic boundaries.

Zero-Shot Transfer to New Domains

The value of steerable visual representations is not only that they are controllable, but that they remain useful far beyond the training setup.

SteerViT transfers zero-shot to industrial anomaly segmentation, despite never being trained on those target domains. With prompts such as “the anomaly in the object”, we can repurpose the learned linear segmentation head to localize subtle defects.

On MVTec AD, SteerViT reaches 79.8 PRO, matching or approaching dedicated zero-shot anomaly segmentation methods.

SteerViT anomaly segmentation examples
Figure 6: Zero-shot anomaly segmentation.
Prompt-conditioned segmentation maps are suitable for visual industrial inspection without task-specific training.
Gate steering demo image
1.0

A Continuous Steerability–Quality Knob

The tanh-gated cross-attention layers do more than stabilize training. They also provide a useful post-hoc control knob at inference time.

By scaling the learned gates with a continous steering factor, SteerViT can interpolate between the original frozen ViT behavior and the fully text-conditioned representation.

In practice, this yields a smooth Pareto frontier and allows one model to serve different operating regimes.

📜 BibTeX

@article{ruthardt2026steervit,
      title={Steerable Visual Representations},
      author={Jona Ruthardt and Manu Gaur and Deva Ramanan and Makarand Tapaswi and Yuki M. Asano},
      journal={Arxiv},
      year={2026}
    }