<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://pipeparodi.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://pipeparodi.com/" rel="alternate" type="text/html" /><updated>2026-06-05T12:19:23-07:00</updated><id>https://pipeparodi.com/feed.xml</id><title type="html">Felipe Parodi</title><subtitle>Computational Neuroethologist | PhD Candidate at UPenn | Research in AI, Deep Learning, and Primate Behavior | Specializing in Social Interaction Analysis</subtitle><author><name>Felipe Parodi</name></author><entry><title type="html">Awesome Active Inference: a community resource</title><link href="https://pipeparodi.com/blog/awesome-active-inference/" rel="alternate" type="text/html" title="Awesome Active Inference: a community resource" /><published>2026-05-27T00:00:00-07:00</published><updated>2026-05-27T00:00:00-07:00</updated><id>https://pipeparodi.com/blog/awesome-active-inference</id><content type="html" xml:base="https://pipeparodi.com/blog/awesome-active-inference/"><![CDATA[<div class="howtocv-links">
  <a href="https://github.com/felipe-parodi/awesome-active-inference" class="howtocv-link-btn"><i class="fab fa-github"></i> GitHub</a>
</div>

<p>Active inference is the claim that perception, action, and learning are all one process. An agent perceives and acts so as to minimize surprise, the gap between what it senses and what its model of the world predicted. Over about two decades Karl Friston and others have stretched this one idea to cover the cortex, psychiatric symptoms, robot control, and reinforcement learning.</p>

<figure>
  <img src="/assets/images/awesome-active-inference/smith-2022-generative-model-process.jpg" alt="Diagram showing the generative process and generative model coupled through observations and actions in a perception-action cycle" style="max-width: 520px;" />
  <figcaption>The whole setup in one picture. The world and the agent's model of it touch at only two points, what the agent observes and what it does. Everything in between is inference. From <a href="https://doi.org/10.1016/j.jmp.2021.102632">Smith, Friston &amp; Whyte, J. Math. Psychol. 2022</a> (CC BY-NC-ND 4.0).</figcaption>
</figure>

<p>I think it is one of the most interesting ideas in neuroscience, and one of the hardest to actually learn. The papers run across twenty years and several fields that barely cite each other, the math is heavy, and most introductions assume you have already read the other introductions. I bounced off it a few times before it stuck. <a href="https://github.com/felipe-parodi/awesome-active-inference"><strong>Awesome Active Inference</strong></a> is the reading list I wish someone had handed me at the start. It runs from textbooks and tutorials through the original free-energy papers, predictive coding, the discrete and continuous-time formulations, deep active inference, computational psychiatry, and the software libraries, in roughly the order I would read them now. There are two starting points at the top. Neuroscientists and machine-learning people need almost entirely different first papers, and the list says which.</p>

<p>The fastest way in for me was predictive coding, the idea that the cortex spends most of its effort predicting its own inputs and passing the errors upward. Rafal Bogacz’s 2017 tutorial writes this out as a small circuit you can actually simulate.</p>

<figure>
  <img src="/assets/images/awesome-active-inference/bogacz-2017-predictive-coding-network.jpg" alt="Schematic of a predictive coding neural network with prediction-error nodes and inhibitory and excitatory connections" style="max-width: 420px;" />
  <figcaption>Predictive coding as an actual circuit. Nodes send predictions down and prediction errors (ε) back up, using only local connections. From <a href="https://doi.org/10.1016/j.jmp.2015.11.003">Bogacz, J. Math. Psychol. 2017</a> (CC BY 4.0).</figcaption>
</figure>

<p>The other half is decision-making. In the discrete-state version an agent plans over a POMDP and trades off getting reward against getting information. Exploration drops out of the same quantity the agent is already minimizing. The continuous-time version connects to control theory and robotics. The same machinery serves as both a theory of what brains do and a recipe for building agents.</p>

<figure>
  <img src="/assets/images/awesome-active-inference/smith-2022-explore-exploit-pomdp.jpg" alt="Explore-exploit slot machine task framed as a POMDP with hidden state factors and outcome modalities" />
  <figcaption>A toy task from Smith et al.'s tutorial. Two slot machines, a hint you can pay for, and a choice about whether the information is worth the cost. This is what a POMDP looks like once it is concrete. From <a href="https://doi.org/10.1016/j.jmp.2021.102632">Smith, Friston &amp; Whyte, J. Math. Psychol. 2022</a> (CC BY-NC-ND 4.0).</figcaption>
</figure>

<p>Like the <a href="/blog/awesome-computational-primatology/">Awesome Computational Primatology</a> list, this is a public GitHub repo. If some paper, tutorial, lecture, or library is the thing that made active inference click for you, please add it.</p>]]></content><author><name>Felipe Parodi</name></author><category term="blog" /><category term="active-inference" /><category term="free-energy-principle" /><category term="predictive-coding" /><category term="computational-neuroscience" /><category term="machine-learning" /><category term="open-science" /><summary type="html"><![CDATA[Active inference is one of the most interesting ideas in neuroscience and one of the hardest to learn. This is the reading list I wish I'd had.]]></summary></entry><entry><title type="html">Zero-Ablation Overstates Register Function in Vision Transformers</title><link href="https://pipeparodi.com/blog/register-tokens-information-flow/" rel="alternate" type="text/html" title="Zero-Ablation Overstates Register Function in Vision Transformers" /><published>2026-03-04T00:00:00-08:00</published><updated>2026-03-04T00:00:00-08:00</updated><id>https://pipeparodi.com/blog/register-tokens-information-flow</id><content type="html" xml:base="https://pipeparodi.com/blog/register-tokens-information-flow/"><![CDATA[<div class="howtocv-links">
  <a href="https://github.com/felipe-parodi/howtocv" class="howtocv-link-btn"><i class="fab fa-github"></i> Code</a>
  <a href="#" class="howtocv-link-btn"><i class="fas fa-file-pdf"></i> Paper PDF</a>
  <a href="#citation" class="howtocv-link-btn"><i class="fas fa-quote-left"></i> Cite</a>
</div>

<p>Jonas and Kording (<a href="https://doi.org/10.1371/journal.pcbi.1005268">2017</a>) applied standard neuroscience analysis techniques to a microprocessor, lesioning individual transistors and measuring which were “necessary” for running Donkey Kong. The results were confident, publishable – and entirely misleading. Lesioning a component and observing what breaks reveals that the component was <em>involved in the circuit</em>, not that it <em>computed</em> the thing that broke.</p>

<p>We encountered an analogous problem with vision transformers.</p>

<p><strong>Zero-ablation</strong> – replacing token activations with zero vectors – is (was?) a widely used tool for probing token function in ViTs. When we zeroed register tokens in DINOv2+registers and DINOv3, classification dropped 36.6 pp and segmentation dropped 30.9 pp. Registers appeared functionally indispensable. Yet when we replaced registers with dataset-mean activations, Gaussian noise, or even registers from <em>completely different images</em>, every task was preserved within 1 pp of baseline. The specific content of registers is dispensable; only their presence matters.</p>

<p>Registers do play a real structural role: they buffer dense features from CLS dependence (zeroing CLS collapses segmentation by 37 pp without registers but &lt;1 pp with them), and they compress patch geometry (effective rank 13.5 → 4.0). Their per-image content, however, is interchangeable. Zero-ablation overstated the story because zero vectors are out-of-distribution – the network never encountered them during training, and injecting them cascades disruption through every subsequent layer.</p>

<p>The remainder of this post describes each experiment and its implications.</p>

<hr />

<h2 id="background-vits-cls-and-registers">Background: ViTs, CLS, and Registers</h2>

<p>A Vision Transformer (ViT) divides an image into non-overlapping patches (typically 14×14 pixels), converts each patch into a vector, and prepends a learnable <strong>CLS token</strong>. All tokens interact through <strong>self-attention</strong> across multiple layers, where each token can attend to every other token to aggregate information. At the output, CLS serves as a global image summary (used for classification), while patch tokens retain spatial information (used for segmentation and correspondence). This global–spatial distinction underlies the experiments below.</p>

<figure class="howtocv-figure">
  <img src="/assets/images/howtocv/approach.svg" alt="Experimental approach: ViT architecture with CLS and register tokens, showing ablation conditions" style="max-width: 100%; width: 700px; background: white; border-radius: 4px; padding: 0.5rem;" />
  <figcaption><strong>Figure 1.</strong> Our setup. A ViT processes an image as CLS + register + patch tokens. We replace CLS or register outputs with zeros, dataset means, noise, or cross-image shuffled values, then measure impact on global tasks (classification, retrieval) and dense tasks (correspondence, segmentation).</figcaption>
</figure>

<p>Patch features are rich and spatially structured. Below, they are projected into three PCA components (mapped to RGB):</p>

<div id="pca-gallery" class="viz-gallery">
  <h4>Interactive: PCA patch features</h4>
  <div class="viz-controls">
    <span class="label">Image:</span>
    <button class="viz-img-btn active" data-img="img0">Fish</button>
    <button class="viz-img-btn" data-img="img1">Bird</button>
    <button class="viz-img-btn" data-img="img2">Dog</button>
    <button class="viz-img-btn" data-img="img3">Building</button>
    <button class="viz-img-btn" data-img="img4">Food</button>
  </div>
  <div class="viz-grid"></div>
</div>

<h3 id="the-artifact-problem-and-register-tokens">The artifact problem and register tokens</h3>

<p><a href="https://arxiv.org/abs/2309.16588">Darcet et al. (2024)</a> found that large self-supervised ViTs produce <strong>high-norm artifact tokens</strong> in low-information regions – patches over sky, water, or uniform backgrounds develop anomalously large activations that distort downstream feature maps. Their fix: append 4 learnable <strong>register tokens</strong> to the input. Registers participate in attention but are discarded at inference, absorbing the artifact computation and leaving patch tokens clean.</p>

<p>DINOv3 (<a href="https://arxiv.org/abs/2508.10104">Simeoni et al., 2025</a>) builds on this with <strong>Gram anchoring</strong> – a training objective that encourages patches to preserve their pairwise spatial relationships. The combination of registers and Gram anchoring produces the current state-of-the-art for dense features. We set out to determine what functional role registers play in this configuration.</p>

<p>The norm heatmaps below illustrate the artifact problem: bright regions indicate high-norm patches. Compare DINOv2 (artifacts in uniform regions) with register-equipped models:</p>

<div id="norm-gallery" class="viz-gallery">
  <h4>Interactive: Patch norm heatmaps</h4>
  <div class="viz-controls">
    <span class="label">Image:</span>
    <button class="viz-img-btn active" data-img="img0">Fish</button>
    <button class="viz-img-btn" data-img="img1">Bird</button>
    <button class="viz-img-btn" data-img="img2">Dog</button>
    <button class="viz-img-btn" data-img="img3">Building</button>
    <button class="viz-img-btn" data-img="img4">Food</button>
  </div>
  <div class="viz-grid"></div>
</div>

<hr />

<h2 id="zero-ablation-results">Zero-Ablation Results</h2>

<p>We zeroed CLS or register tokens at every transformer layer and measured the downstream impact on four tasks:</p>

<table>
  <thead>
    <tr>
      <th>Task</th>
      <th>Type</th>
      <th>Metric</th>
      <th>What it measures</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Classification</strong></td>
      <td>Global</td>
      <td>Top-1 accuracy</td>
      <td>Can a linear classifier read object identity from CLS?</td>
    </tr>
    <tr>
      <td><strong>kNN retrieval</strong></td>
      <td>Global</td>
      <td>Recall@1</td>
      <td>Can CLS find the most similar image?</td>
    </tr>
    <tr>
      <td><strong>Correspondence</strong></td>
      <td>Dense</td>
      <td>Accuracy</td>
      <td>Can patches match the same object part across images?</td>
    </tr>
    <tr>
      <td><strong>Segmentation</strong></td>
      <td>Dense</td>
      <td>mIoU</td>
      <td>Can patches assign correct semantic labels?</td>
    </tr>
  </tbody>
</table>

<p>The interactive heatmap below summarizes the full ablation results. Toggle between raw accuracy and delta-from-baseline:</p>

<div id="heatmap-container" class="howtocv-interactive">
  <div class="howtocv-controls">
    <label class="howtocv-toggle">
      <input type="checkbox" id="heatmap-delta-toggle" />
      <span>Show as delta from full model</span>
    </label>
  </div>
  <div id="heatmap-chart"></div>
  <noscript>
    <p class="howtocv-fallback">Interactive chart requires JavaScript. Summary: zeroing CLS devastates DINOv2 across all tasks but barely affects DINOv2+reg and DINOv3 on dense tasks. Zeroing registers devastates DINOv3 across the board.</p>
  </noscript>
</div>

<h3 id="cls-zeroing-dense-tasks-are-buffered">CLS zeroing: dense tasks are buffered</h3>

<p>In DINOv2 (no registers), zeroing CLS is catastrophic across all tasks: classification drops from 73.2% to 0.1%, correspondence falls 15.9 pp, segmentation falls 37.1 pp.</p>

<p>With registers present, the pattern is markedly different. CLS zeroing still eliminates classification (the linear probe reads from CLS, so this is expected), but dense tasks are largely unaffected:</p>

<ul>
  <li>DINOv2+reg correspondence: 69.1% → 68.3% (−0.8 pp)</li>
  <li>DINOv3 segmentation: 78.5% → 78.5% (0.0 pp)</li>
</ul>

<p>Registers have absorbed the role CLS previously played for spatial features. Patch representations no longer depend on CLS.</p>

<h3 id="register-zeroing-everything-collapses">Register zeroing: everything collapses</h3>

<p>Zeroing registers produces large drops, especially in DINOv3:</p>

<ul>
  <li>Classification: 62.0% → 25.4% (−36.6 pp)</li>
  <li>Segmentation: 78.5% → 47.6% (−30.9 pp)</li>
  <li>Correspondence: 78.9% → 57.8% (−21.1 pp)</li>
</ul>

<p>Taken at face value, registers appear to carry critical information that the network depends on. This interpretation, however, does not hold up.</p>

<p>You can see the ablation effects directly in patch PCA features. Compare “Full” with “Zero CLS” (barely changes) and “Zero Registers” (collapses):</p>

<div id="ablation-gallery" class="viz-gallery">
  <h4>Interactive: Ablation PCA features</h4>
  <div class="viz-controls">
    <span class="label">Model:</span>
    <button class="viz-model-btn" data-model="dinov2_no_reg">DINOv2</button>
    <button class="viz-model-btn" data-model="dinov2_reg">DINOv2+reg</button>
    <button class="viz-model-btn active" data-model="dinov3_vits16">DINOv3</button>
  </div>
  <div class="viz-controls">
    <span class="label">Image:</span>
    <button class="viz-img-btn active" data-img="img0">Fish</button>
    <button class="viz-img-btn" data-img="img1">Bird</button>
    <button class="viz-img-btn" data-img="img2">Dog</button>
    <button class="viz-img-btn" data-img="img3">Building</button>
    <button class="viz-img-btn" data-img="img4">Food</button>
  </div>
  <div class="viz-grid"></div>
</div>

<hr />

<h2 id="register-content-is-fungible">Register Content is Fungible</h2>

<p>The problem is straightforward: a zero vector is something the network never encountered during training. Register tokens start as fixed learned embeddings, then are shaped by 12 layers of self-attention with the image’s patches. By the final layer, they occupy a characteristic activation distribution – specific means, variances, and covariance structure. The zero vector sits far outside this distribution.</p>

<p>Zeroing registers therefore does not simply remove information – it injects an input that is far out-of-distribution relative to what the network learned to process. This corrupts the attention computation, which corrupts the next layer’s input, which cascades through every remaining layer.</p>

<p>To test whether the drops reflect genuine content dependence or just distributional disruption, we ran three replacement controls:</p>

<ul>
  <li><strong>Mean substitution</strong>: Replace register outputs at each layer with the per-layer dataset-mean activation (calibrated on 5,000 images). Stays on-manifold, removes image-specific content.</li>
  <li><strong>Noise substitution</strong>: Replace with Gaussian noise matched in per-dimension mean and variance. Right marginal statistics, no learned structure.</li>
  <li><strong>Cross-image shuffling</strong>: Swap register activations across images in the batch, independently at each layer. These are <em>real</em> register values from <em>real</em> images – just the wrong images.</li>
</ul>

<p>If models depend on register <em>content</em>, all three should degrade performance. If they depend only on register <em>presence</em>, plausible replacements should work fine.</p>

<p>All three preserve performance:</p>

<table>
  <thead>
    <tr>
      <th>Condition</th>
      <th>CLS (v2+R / v3)</th>
      <th>Corr. (v2+R / v3)</th>
      <th>Seg. (v2+R / v3)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Full</td>
      <td>67.3 / 62.0</td>
      <td>69.1 / 78.9</td>
      <td>71.3 / 78.5</td>
    </tr>
    <tr>
      <td>Zero registers</td>
      <td>48.4 / 25.4</td>
      <td>64.3 / 57.8</td>
      <td>61.7 / 47.6</td>
    </tr>
    <tr>
      <td>Mean-sub</td>
      <td>67.0 / 62.1</td>
      <td>68.8 / 78.8</td>
      <td>71.6 / 78.6</td>
    </tr>
    <tr>
      <td>Noise-sub</td>
      <td>67.0 / 62.0</td>
      <td>68.7 / 78.7</td>
      <td>71.5 / 78.6</td>
    </tr>
    <tr>
      <td>Shuffle</td>
      <td>67.8 / 62.0</td>
      <td>68.5 / 78.6</td>
      <td>71.2 / 78.6</td>
    </tr>
  </tbody>
</table>

<p>Only zeroing causes catastrophic drops. Every plausible replacement preserves every task within ~1pp.</p>

<p>The shuffle condition is the strongest test. By layer 11, registers have been shaped by 12 layers of attention with a specific image’s patches – they have been conditioned on that image’s content through the entire forward pass. Yet swapping in registers conditioned on <em>completely different images</em> does not degrade any task. Despite 12 layers of image-specific conditioning, the resulting register content is dispensable.</p>

<p><strong>CLS, by contrast, is not fungible.</strong> Mean-substituting CLS yields 0.1% classification – the same as zeroing. CLS content is genuinely image-specific and functionally necessary. The fungibility is specific to registers.</p>

<p><a href="https://arxiv.org/abs/2506.08010">Jiang et al. (2025)</a> showed that even <em>untrained</em> register tokens suffice for artifact removal. We extend this finding: even in models <em>trained with</em> registers, the per-image content they develop through 12 layers of attention is unnecessary for all standard downstream tasks.</p>

<hr />

<h2 id="why-zeroing-is-misleading">Why Zeroing is Misleading</h2>

<p>To see <em>why</em> only zeroing causes damage, we measured <strong>Jensen-Shannon divergence</strong> between full and ablated attention patterns at every layer.</p>

<p>Register zeroing causes cascading divergence that amplifies layer by layer: in DINOv3, JS divergence starts at 0.00 at layer 0 (identical input, no difference yet) and grows to 0.18 by layer 11. Mean-substitution stays below 0.005 at every layer. That’s a ~250x gap.</p>

<figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig_attention_rewiring.svg" alt="JS divergence under ablation across layers" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 2.</strong> Why zeroing is misleading. (a) JS divergence vs. layer: register zeroing (solid) causes cascading divergence as the OOD zero vector compounds through layers; mean-substitution (dashed) preserves attention patterns. Lighter lines show ViT-B scale. (b) CLS attention redistribution when registers are zeroed.</figcaption>
</figure>

<p>Per-patch cosine similarity confirms this pattern. Under plausible replacements, each patch’s features have 0.95–0.999 cosine similarity to the unmodified condition – a genuine perturbation, but a small one. Under zeroing, cosine drops to ~0.6. The zero vector doesn’t just remove register content; it breaks the entire downstream computation.</p>

<figure class="howtocv-figure">
  <img src="/assets/images/howtocv/correspondence/qualitative.svg" alt="Correspondence under ablation" style="max-width: 100%; width: 900px;" />
  <figcaption><strong>Figure 3.</strong> Correspondence results. Top: full model (green = correct). Middle: zero CLS – matches preserved. Bottom: zero registers – spatial matching collapses.</figcaption>
</figure>

<hr />

<h2 id="what-holds-under-proper-controls">What Holds Under Proper Controls</h2>

<p>Not everything is an artifact of zeroing. Three findings hold up under proper controls.</p>

<h3 id="registers-buffer-dense-features-from-cls">Registers buffer dense features from CLS</h3>

<p>The CLS-zeroing asymmetry doesn’t depend on register ablation, so it’s a genuine architectural effect. Without registers, zeroing CLS collapses segmentation by 37.1pp. With registers, the drop is &lt;1pp. Registers have absorbed the global computation that patches used to get from CLS, freeing them for spatial encoding. This is the clearest evidence that registers reshape information flow.</p>

<h3 id="registers-compress-patch-geometry">Registers compress patch geometry</h3>

<p>Under the full (unablated) condition, adding registers reduces the effective rank of the patch-to-patch Gram matrix from 13.5 (DINOv2) to 8.7 (DINOv2+reg) – a 36% compression. DINOv3 compresses further to 4.0.</p>

<figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig2_combined.svg" alt="Heatmap, effective rank, and eigenspectrum" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 4.</strong> (a) Task x ablation delta heatmap. (b) Effective rank: registers compress patch geometry; DINOv3 exhibits the most compression. (c) Eigenspectrum in log scale – DINOv3 concentrates variance into fewer directions. All ViT-S.</figcaption>
</figure>

<p>DINOv3 simultaneously differs in patch size, positional encoding (RoPE), and distillation recipe, so we can’t attribute the extra compression solely to Gram anchoring. But the trend is clear: register-equipped models produce lower-dimensional, more structured patch representations.</p>

<h3 id="attention-routing-scales-with-register-dependence">Attention routing scales with register dependence</h3>

<p>CLS directs 17.9% of its last-layer attention to registers in DINOv2+reg and 29.1% in DINOv3. This tracks the increasing register-zeroing sensitivity we observed. The interactive below traces attention flow across all 12 layers:</p>

<div id="attention-container" class="howtocv-interactive">
  <h3 class="howtocv-section-title">Interactive: Attention flow across layers</h3>
  <p>Use the slider to see how attention mass redistributes across layers. Watch how patches progressively attend more to registers in DINOv3.</p>
  <div class="howtocv-controls">
    <div class="howtocv-btn-group" id="attention-model-select">
      <button class="howtocv-btn" data-model="dinov2_no_reg">DINOv2</button>
      <button class="howtocv-btn active" data-model="dinov2_reg">DINOv2+reg</button>
      <button class="howtocv-btn" data-model="dinov3_vits16">DINOv3</button>
    </div>
    <div class="howtocv-slider-row">
      <label for="attention-layer-slider">Layer:</label>
      <input type="range" id="attention-layer-slider" min="0" max="11" value="0" step="1" />
      <span id="attention-layer-label">0</span>
      <button class="howtocv-btn howtocv-btn-sm" id="attention-play-btn">&#9654; Play</button>
    </div>
  </div>
  <div id="attention-chart"></div>
  <noscript>
    <img src="/assets/images/howtocv/attention_flow_static.png" alt="Attention flow across layers" style="max-width: 100%;" />
  </noscript>
</div>

<h3 id="all-findings-replicate-at-vit-b-scale">All findings replicate at ViT-B scale</h3>

<p>We ran the full experiment suite with ViT-B backbones. Ablation delta patterns are nearly identical:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Condition</th>
      <th>CLS</th>
      <th>Corr.</th>
      <th>Seg.</th>
      <th>SPair</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>DINOv2-B</td>
      <td>Full</td>
      <td>78.7</td>
      <td>72.9</td>
      <td>72.0</td>
      <td>41.2</td>
    </tr>
    <tr>
      <td> </td>
      <td>Zero CLS</td>
      <td>0.1 (−78.6)</td>
      <td>58.9 (−14.0)</td>
      <td>46.1 (−25.9)</td>
      <td>21.3 (−19.9)</td>
    </tr>
    <tr>
      <td>DINOv2-B+reg</td>
      <td>Full</td>
      <td>74.5</td>
      <td>71.2</td>
      <td>72.3</td>
      <td>41.1</td>
    </tr>
    <tr>
      <td> </td>
      <td>Zero CLS</td>
      <td>0.1 (−74.4)</td>
      <td>70.4 (−0.8)</td>
      <td>72.3 (0.0)</td>
      <td>41.2 (+0.1)</td>
    </tr>
    <tr>
      <td> </td>
      <td>Zero Reg</td>
      <td>55.2 (−19.3)</td>
      <td>63.3 (−7.9)</td>
      <td>64.1 (−8.2)</td>
      <td>28.8 (−12.3)</td>
    </tr>
    <tr>
      <td>DINOv3-B</td>
      <td>Full</td>
      <td>73.3</td>
      <td>77.1</td>
      <td>83.4</td>
      <td>37.9</td>
    </tr>
    <tr>
      <td> </td>
      <td>Zero CLS</td>
      <td>0.1 (−73.2)</td>
      <td>79.5 (+2.4)</td>
      <td>82.8 (−0.6)</td>
      <td>37.8 (−0.1)</td>
    </tr>
    <tr>
      <td> </td>
      <td>Zero Reg</td>
      <td>36.8 (−36.5)</td>
      <td>61.3 (−15.8)</td>
      <td>59.6 (−23.8)</td>
      <td>19.1 (−18.8)</td>
    </tr>
  </tbody>
</table>

<p>DINOv3-B loses −36.5 pp classification under register zeroing (vs. −36.6 at ViT-S), and the CLS-buffering asymmetry holds. Paired permutation tests (10,000 permutations) confirm: DINOv3 vs. DINOv2+reg register-zeroing sensitivity, <em>p</em> &lt; 0.001; CLS-buffering effect on segmentation, <em>p</em> &lt; 0.001; ViT-S vs. ViT-B register dependence consistency, <em>p</em> = 0.80 (not significant, as expected for scale replication).</p>

<figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig7_scale_comparison.svg" alt="ViT-S vs ViT-B scale comparison" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 5.</strong> Scale comparison. Solid = ViT-S, dashed = ViT-B. (a) Classification and (b) segmentation under ablation. The patterns are consistent across scales.</figcaption>
</figure>

<p>These three findings – CLS buffering, geometric compression, and attention routing – characterize the structural role registers play. They hold regardless of what specific activations occupy the register slots.</p>

<hr />

<details>
  <summary><strong>Register specialists (click to expand)</strong></summary>

  <p><strong>Caveat upfront:</strong> The substitution controls show that the decodable content described here is not functionally necessary. Class information is present in individual registers, but models don’t need it for any measured task. These patterns characterize representational structure, not functional dependence.</p>

  <h3 id="dinov2reg-r2-is-the-specialist">DINOv2+reg: R2 is the specialist</h3>

  <p>Register R2 stands apart. Its nearest-neighbor patches are dominated by dark, low-information regions – borders, shadows, uniform backgrounds. Its cosine similarity to other registers is just 0.11, far below the 0.87–0.90 range among R1/R3/R4. When R2 alone is zeroed, classification drops −4.9pp; zeroing any other single register barely matters (&lt;0.2pp). R2 handles artifact-absorption. R1, R3, and R4 are semantic generalists – their nearest-neighbor patches include object parts and textures, and they carry comparable classification information (61–64% each).</p>

  <h3 id="dinov3-the-inversion">DINOv3: the inversion</h3>

  <p>DINOv3 inverts this pattern. R3 becomes the semantic specialist – probe accuracy of 50.5%, far above R1 (4.1%) and R2 (15.2%). R1, R2, and R4 match to low-level patches: ground textures, dark backgrounds, homogeneous regions. Gram anchoring reorganized how the network distributes computation across registers.</p>

  <figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig3_registers_combined.svg" alt="Per-register analysis" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 6.</strong> (a) Per-register classification accuracy. (b–c) Pairwise cosine similarity – DINOv2+reg R2 is structurally distinct. (d–e) Per-register lesion effects (note: zeroing is an OOD intervention).</figcaption>
</figure>

  <div id="gallery-container" class="howtocv-interactive">
  <h3 class="howtocv-section-title">Interactive: Register nearest-neighbor gallery</h3>
  <p>Select a model and register to see which image patches are most similar to each register's learned representation.</p>
  <div class="howtocv-controls">
    <div class="howtocv-btn-group" id="gallery-model-select">
      <button class="howtocv-btn active" data-model="dinov2_reg">DINOv2+reg</button>
      <button class="howtocv-btn" data-model="dinov3_vits16">DINOv3</button>
    </div>
    <div class="howtocv-btn-group" id="gallery-reg-select">
      <button class="howtocv-btn active" data-reg="r1">R1</button>
      <button class="howtocv-btn" data-reg="r2">R2</button>
      <button class="howtocv-btn" data-reg="r3">R3</button>
      <button class="howtocv-btn" data-reg="r4">R4</button>
    </div>
  </div>
  <div id="gallery-grid"></div>
  <noscript>
    <img src="/assets/images/howtocv/register_nn_static.png" alt="Register nearest-neighbor vocabulary" style="max-width: 100%;" />
  </noscript>
</div>

  <p>Registers develop structured, differentiated representations – but as the controls in the main text show, none of this content is functionally necessary for downstream tasks.</p>

</details>

<details>
  <summary><strong>Temporal dynamics: when do registers matter? (click to expand)</strong></summary>

  <h3 id="attention-routing-precedes-semantic-content">Attention routing precedes semantic content</h3>

  <p>We traced two signals across all 12 layers: attention mass flowing to registers and classification information in each register (via linear probes). These two signals are dissociated.</p>

  <p>Patches start attending to registers from mid-layers onward, building gradually. But semantic content emerges abruptly at layers 10–11. All tokens carry near-random classification accuracy through layer 8 (&lt;6% for DINOv2+reg, &lt;14% for DINOv3). Then accuracy jumps sharply. The attention routing infrastructure gets built several layers before any semantic content appears.</p>

  <figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig4_attention.svg" alt="CLS attention distribution" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 7.</strong> (a) CLS attention fraction per token type. DINOv2+reg: 17.9% to registers; DINOv3: 29.1%. (b) Per-register breakdown.</figcaption>
</figure>

  <div id="probe-container" class="howtocv-interactive">
  <h3 class="howtocv-section-title">Interactive: Layer-wise register probing</h3>
  <p>Drag the slider to see classification accuracy at each layer. Note the sharp jump at layers 10–11.</p>
  <div id="probe-chart"></div>
  <noscript>
    <img src="/assets/images/howtocv/layer_probe_static.png" alt="Layer-wise probing results" style="max-width: 100%;" />
  </noscript>
</div>

  <p>Per-register dynamics are interesting: DINOv3’s R1 peaks at layer 10 then <em>drops</em> at layer 11, despite receiving the most attention. This suggests a transient computation buffer – it temporarily holds information before passing it along, rather than accumulating a final answer. R3 rises monotonically through layer 11, acting as a semantic accumulator.</p>

  <figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig5_layer_sweep.svg" alt="Layer-wise task performance" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 8.</strong> (a) CLS classification across layers – near-random until layer 8, then rises steeply. (b) Correspondence peaks at mid-layers then declines, except DINOv3 which maintains 78.9% at layer 11.</figcaption>
</figure>

  <figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig_gram_dependence.svg" alt="Gram compression and register dependence across layers" style="max-width: 100%; width: 700px;" />
  <figcaption><strong>Figure 9.</strong> (a) Effective rank across layers: geometric compression is present by layer 6 in all register-equipped models. (b) CLS accuracy (Full vs. Zero-reg) across layers: register dependence emerges abruptly at layers 10–11, well after compression is established.</figcaption>
</figure>

  <p>Attention routing to registers is established well before semantic content appears, consistent with registers serving as structural placeholders rather than content-specific processors.</p>

</details>

<details>
  <summary><strong>Cumulative lesions and negative controls (click to expand)</strong></summary>

  <h3 id="non-additive-effects">Non-additive effects</h3>

  <p>Zeroing registers one at a time produces modest drops. But zeroing all four causes collapse far exceeding the sum of individual effects – DINOv2+reg: sum of individual deltas = −5.2pp, collective = −18.9pp; DINOv3: sum = −7.0pp, collective = −36.6pp. This is consistent with zeroing being a disproportionately destructive intervention that compounds across token positions – the OOD disruption from zeroing one register is modest, but zeroing all four creates a much larger distributional shift.</p>

  <figure class="howtocv-figure">
  <img src="/assets/images/howtocv/cumulative_lesion.svg" alt="Cumulative register lesion" style="max-width: 100%; width: 600px;" />
  <figcaption><strong>Figure 10.</strong> Cumulative register lesion: zeroing registers one at a time. Solid = actual, dashed = additive prediction. Both models show super-additive degradation.</figcaption>
</figure>

  <h3 id="random-patch-negative-control">Random patch negative control</h3>

  <p>A natural question is whether zeroing any four tokens produces comparable damage. Zeroing four random patch tokens causes ≤1 pp drop – confirming the register effect is specific. But this specificity reflects registers’ distinct activation distribution (zeros are more OOD for registers than for patches), not necessarily unique functional content.</p>

  <figure class="howtocv-figure">
  <img src="/assets/images/howtocv/fig9_random_patch_control.svg" alt="Random patch control" style="max-width: 100%; width: 600px;" />
  <figcaption><strong>Figure 11.</strong> Negative control: zeroing 4 random patches has negligible effect vs. zeroing 4 registers.</figcaption>
</figure>

  <p>Here are the attention maps under different conditions:</p>

  <div id="attn-gallery" class="viz-gallery">
  <h4>Interactive: Attention map overlays</h4>
  <div class="viz-controls">
    <span class="label">Image:</span>
    <button class="viz-img-btn active" data-img="img0">Fish</button>
    <button class="viz-img-btn" data-img="img1">Bird</button>
    <button class="viz-img-btn" data-img="img2">Dog</button>
    <button class="viz-img-btn" data-img="img3">Building</button>
    <button class="viz-img-btn" data-img="img4">Food</button>
  </div>
  <div class="viz-controls">
    <span class="label">Mode:</span>
    <button class="viz-mode-btn active" data-mode="cls">CLS attention</button>
    <button class="viz-mode-btn" data-mode="regs">Register attention</button>
  </div>
  <div class="viz-grid"></div>
</div>

  <p>The super-additive pattern and the register-specific sensitivity are both consistent with zeroing as a disproportionately destructive OOD intervention, not evidence of unique register content.</p>

</details>

<hr />

<h2 id="takeaways">Takeaways</h2>

<ol>
  <li>
    <p><strong>Don’t trust zero-ablation alone.</strong> Zeroing injects OOD inputs that cascade disruption, overstating functional dependence. Always pair it with replacement controls. Cross-image shuffling is the strongest test; mean-substitution is the simplest to implement.</p>
  </li>
  <li>
    <p><strong>Register slots matter, register content doesn’t</strong> (for standard frozen-feature tasks). The network has reorganized its computation around those slots. Any plausible activation works – dataset means, noise, wrong-image registers.</p>
  </li>
  <li>
    <p><strong>CLS content genuinely matters.</strong> Mean-substituting CLS also kills classification (0.1%). The fungibility is specific to registers, not an artifact of the controls being weak.</p>
  </li>
  <li>
    <p><strong>Registers buffer dense features from CLS dependence.</strong> This is a real architectural effect confirmed by the CLS-zeroing asymmetry (37pp segmentation drop without registers vs &lt;1pp with them) – and it doesn’t depend on register ablation.</p>
  </li>
  <li>
    <p><strong>Scale-consistent.</strong> All findings replicate across ViT-S and ViT-B.</p>
  </li>
  <li>
    <p><strong>Open question.</strong> Our fungibility result covers standard frozen-feature evaluations – classification, correspondence, segmentation. Tasks requiring fine-grained register content (few-shot adaptation, generation) remain untested.</p>
  </li>
</ol>

<hr />

<h2 id="citation">Citation</h2>

<div class="language-bibtex highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nc">@article</span><span class="p">{</span><span class="nl">parodi2026zero</span><span class="p">,</span>
  <span class="na">title</span><span class="p">=</span><span class="s">{Zero-Ablation Overstates Register Function
         in {DINO} Vision Transformers}</span><span class="p">,</span>
  <span class="na">author</span><span class="p">=</span><span class="s">{Parodi, Felipe and Matelsky, Jordan K. and Segado, Melanie}</span><span class="p">,</span>
  <span class="na">year</span><span class="p">=</span><span class="s">{2026}</span><span class="p">,</span>
  <span class="na">note</span><span class="p">=</span><span class="s">{Manuscript}</span>
<span class="p">}</span>
</code></pre></div></div>

<div class="howtocv-footer">
  <p>Built with PyTorch, HuggingFace Transformers, and DINOv3. Interactive visualizations powered by Plotly.js and D3.js.</p>
</div>

<!-- Load interactive scripts -->
<script src="https://cdn.plot.ly/plotly-2.27.0.min.js"></script>

<script src="https://d3js.org/d3.v7.min.js"></script>

<script src="/assets/js/howtocv-heatmap.js"></script>

<script src="/assets/js/howtocv-probes.js"></script>

<script src="/assets/js/howtocv-attention.js"></script>

<script src="/assets/js/howtocv-gallery.js"></script>

<script src="/assets/js/howtocv-viz-gallery.js"></script>]]></content><author><name>Felipe Parodi</name></author><category term="blog" /><category term="vision-transformers" /><category term="interpretability" /><category term="DINOv3" /><category term="self-supervised-learning" /><summary type="html"><![CDATA[Zeroing register tokens suggests they are indispensable – but replacing them with noise, dataset means, or even registers from wrong images preserves every task. The zero vector is the problem, not the registers.]]></summary></entry><entry><title type="html">Awesome Computational Primatology: a community resource</title><link href="https://pipeparodi.com/blog/awesome-computational-primatology/" rel="alternate" type="text/html" title="Awesome Computational Primatology: a community resource" /><published>2024-03-11T00:00:00-07:00</published><updated>2024-03-11T00:00:00-07:00</updated><id>https://pipeparodi.com/blog/awesome-computational-primatology</id><content type="html" xml:base="https://pipeparodi.com/blog/awesome-computational-primatology/"><![CDATA[<div class="howtocv-links">
  <a href="https://kordinglab.com/awesome-computational-primatology/" class="howtocv-link-btn"><i class="fas fa-globe"></i> Website</a>
  <a href="https://github.com/KordingLab/awesome-computational-primatology" class="howtocv-link-btn"><i class="fab fa-github"></i> GitHub</a>
  <a href="https://huggingface.co/datasets/fparodi/awesome-computational-primatology" class="howtocv-link-btn">&#129303; HuggingFace</a>
</div>

<p>Understanding how primates move, communicate, and interact in their natural environments is one of the problems I care about most in biology. Since around 2011, researchers have built systems that detect primate faces, reconstruct 3D body pose from dozens of synchronized cameras, classify complex social behaviors, decode vocalizations, and generate realistic 3D avatars. The work now spans 14 topic areas, dozens of species from lemurs to great apes, and methods ranging from detection and pose estimation to facial action coding, hand tracking, species identification, and reinforcement learning.</p>

<figure>
  <img src="/assets/images/awesome-comp-primatology/loos-ernst-2013-chimp-faces.png" alt="Automated chimpanzee face detection showing detected faces and eyes marked in green across two field datasets" />
  <figcaption>Among the earliest automated chimpanzee face detection systems, with detected faces and eyes marked in green across zoo and field datasets. From <a href="https://doi.org/10.1186/1687-5281-2013-49">Loos &amp; Ernst, EURASIP J. Image Video Process. 2013</a>, CC-BY 2.0.</figcaption>
</figure>

<p>To help the community navigate this growing literature, we built <a href="https://kordinglab.com/awesome-computational-primatology/"><strong>Awesome Computational Primatology</strong></a> (<a href="https://github.com/KordingLab/awesome-computational-primatology">GitHub</a>, <a href="https://huggingface.co/datasets/fparodi/awesome-computational-primatology">HF</a>) — a curated, open registry of 97+ papers at this intersection, with an <a href="https://kordinglab.com/awesome-computational-primatology/">AI-powered chat assistant</a> for querying the corpus in natural language.</p>

<figure>
  <img src="/assets/images/awesome-comp-primatology/openmonkeystudio-2020-3d-pose.png" alt="OpenMonkeyStudio multi-camera 3D macaque pose estimation system" style="max-width: 400px;" />
  <figcaption>OpenMonkeyStudio reconstructs 13 body landmarks in 3D from 62 synchronized cameras, enabling markerless motion capture in freely moving macaques. From <a href="https://doi.org/10.1038/s41467-020-18441-5">Bala et al., Nat. Commun. 2020</a>, CC-BY 4.0.</figcaption>
</figure>

<figure>
  <img src="/assets/images/awesome-comp-primatology/chimact-2023-behavior.png" alt="ChimpACT dataset showing annotated chimpanzee video frames with pose and behavior labels" />
  <figcaption>ChimpACT provides 160,500 annotated frames for joint detection, tracking, pose estimation, and behavior recognition in chimpanzees. From <a href="https://doi.org/10.48550/arXiv.2310.16447">Ma et al., NeurIPS 2023</a>.</figcaption>
</figure>

<p>But the diversity of approaches also shows how far we have to go. No single method, dataset, or species captures the full complexity of primate behavior — and too many models and datasets stay siloed or invisible to researchers working on related problems. That is why resources like this matter: connecting work across species, modalities, and methods so we can see where the gaps are and where open tools already exist. If you work at this intersection — or want to — we would love your contributions. Add a paper, open-source a model, share a dataset. Solving behavior understanding in primates is not something any one lab will crack alone; it will take a community building bridges across all of these approaches, and I believe this generation of researchers is up for it.</p>

<figure>
  <img src="/assets/images/awesome-comp-primatology/primateface-2025-dataset.png" alt="PrimateFace dataset showing annotated face images with 68 facial landmarks across six primate superfamilies" />
  <figcaption>PrimateFace provides 260K+ annotated face images with 68 facial landmarks across six primate superfamilies, from lemurs to humans. From <a href="https://doi.org/10.1101/2025.08.12.669927">Parodi et al., bioRxiv 2025</a>, CC-BY 4.0.</figcaption>
</figure>]]></content><author><name>Felipe Parodi</name></author><category term="blog" /><category term="computational-primatology" /><category term="deep-learning" /><category term="computer-vision" /><category term="primate-behavior" /><category term="open-science" /><summary type="html"><![CDATA[A curated, open resource cataloging 97+ papers at the intersection of deep learning and primate research — from early face recognition to cross-species behavior understanding.]]></summary></entry></feed>