Stronger Semantic Encoders Can Harm Relighting Performance:
Probing Visual Priors via Augmented Latent Intrinsics

In Submission

1University of Amsterdam 2The University of Chicago 3John Hopkins University
Teaser figure

Stronger Semantic Encoders Can Harm Relighting Performance.
Left: Visual comparison on a scene with complex specular materials. The task is to relight the input image (top-left) using the target illumination (bottom-left), which requires moving specular highlights from left to right, as indicated by the chrome sphere. While features from semantic encoders (CLIP, DINO) fail to reproduce realistic highlights, the MAE plausibly moves the highlight but blurs fine details, such as text labels. Our method (top-right), which combines features from RADIO (a pretrained model; distilled from many vision encoders) with latent intrinsics, closely matches the ground truth.
Right: Quantitative analysis reveals a trade-off: for most encoders optimized for pure semantics, relighting quality (PSNR) is inversely correlated with recognition performance (ImageNet-1K linear probing as reported in the original papers). Our approach breaks this trend, achieving high performance on both tasks.

Abstract

Image-to-image relighting requires representations that disentangle scene properties from illumination. Recent methods rely on latent intrinsic representations but remain under-constrained and often fail on challenging materials such as metal and glass. A natural hypothesis is that stronger pretrained visual priors should resolve these failures. We find the opposite: features from top-performing semantic encoders often degrade relighting quality, revealing a fundamental trade-off between semantic abstraction and photometric fidelity. We study this trade-off and introduce Augmented Latent Intrinsics (ALI), which balances semantic context and dense photometric structure by fusing features from a pixel-aligned visual encoder into a latent-intrinsic framework, together with a self-supervised refinement strategy to mitigate the scarcity of paired real-world data. Trained only on unlabeled real-world image pairs and paired with a dense, pixel-aligned visual prior, ALI achieves strong improvements in relighting, with the largest gains on complex, specular materials.