z-lab's DFlash drafter on our QLoRA fine-tune captured 92% of the published speedup with no retraining. ~15x faster, ~4x cheaper in prod.
DFlashSpeculative DecodingGemma 4vLLMInferenceH100QLoRADFO
Read More about Speculative Decoding for Free: Pairing DFlash with our DFO-Tuned Gemma 4 31B Two natively-trained 1-bit LLMs converge on the same activation anomaly without ever sharing weights. A note on convergence under pressure.
LarQLInterpretabilityCKACross-ModelMechanistic InterpretabilityUniversal Constants
Read More about The Two Models That Never Met. Both Measured at the Same Depth. Two natively-trained 1-bit LLMs lose the four-stage circuit that organizes fp16 transformers. Behavior survives; structure dissolves.
LarQLInterpretabilityQuantizationBitNetBonsaiMechanistic Interpretability
Read More about When the Circuit Dissolves A 200-item RAG arena tied at the mean, but two LLM judges only agreed at Spearman ρ=0.55. They aren't measuring the same thing.
RAG-ArenaScoredQARAG RoutingEXITLLM-as-JudgeSpearmanEvaluationQLoRA
Read More about Inside the RAG Arena: When the Judges Don't Agree A LarQL patch deletes 'Paris is the capital of France' from a frozen LLM. The receipt is a 100-byte JSON file with a SHA-256 checksum.
LarQLInterpretabilityKnowledge EditingUnlearningMechanistic Interpretability
Read More about Deleting Paris from a Language Model Every modern LLM converges on the same four-stage circuit. We map the architecture and what its absence in 1-bit models means.
LarQLInterpretabilityTransformersMachine LearningMechanistic Interpretability
Read More about The Architecture Every Language Model Converges To