Mechanistic Interpretability

Posts in tags: "Mechanistic Interpretability" (4 posts)

May 01, 2026

Research

The Two Models That Never Met. Both Measured at the Same Depth.

Gemma4 and Qwen3 were trained by different organizations on different data with different architectures. Their internal representations are 99.2% similar at matched depth. Neither model knew the other existed.

LarQLInterpretabilityCKACross-ModelMechanistic InterpretabilityUniversal Constants

April 27, 2026

Research

When the Circuit Dissolves

Two natively-trained 1-bit language models, from two different organizations, converge on the same anomaly: the four-stage circuit that organizes every fp16 transformer simply isn't there. Both models still answer correctly. The structure is gone, but the behavior survived.

LarQLInterpretabilityQuantizationBitNetBonsaiMechanistic Interpretability

April 25, 2026

Research

Deleting Paris from a Language Model

A single rank-1 weight edit suppresses one learned fact while leaving the rest of the model intact. No fine-tuning. No retraining. Just a feature subtracted from one layer's gate matrix — with a receipt.

LarQLInterpretabilityKnowledge EditingUnlearningMechanistic Interpretability

April 23, 2026

Research

The Architecture Every Language Model Converges To

I've run LarQL on 9 models from 5 organizations — from a 360M toy to OpenAI's 120B MoE. Three numbers hold within ±15% across all of them. One pattern vanishes the moment you go to 1-bit weights.

LarQLInterpretabilityTransformersMachine LearningMechanistic Interpretability