| Below a critical scale (~3.5B for Pythia), reasoning and truthfulness ANTICORRELATE: r = -0.989. Train the model to reason better, and it gets less truthful. This is the alignment tax. Above that scale, they COOPERATE. The tax vanishes. Not gradually — it flips. But here's what matters for practitioners: the critical scale is a design parameter, not a constant. Three levers shift it:
Pretraining contributes ~10:1 over RLHF. The tax is not a property of small models — it's a property of how they were trained. Where does the tax live? Not inside the model. 38/40 models have ZERO competing attention heads. The bottleneck is at the output projection — a dimensional compression artifact that wider models resolve. Proof-of-concept intervention: Adding a truth-direction vector at the bottleneck layer (quarter-depth) corrects 60% of misaligned outputs at tax scale. Zero retraining. Zero weight modification. Works on any open-weight HuggingFace model: THE FRONTIER (Paper 2: "Growing Pains of Frontier Models")At frontier scale (34 models, 10 labs), capabilities cooperate (r = +0.72). But cooperation varies systematically. The h-field — each model's deviation from the cooperative trend — reveals each lab's training philosophy:
Per-lab coupling slopes vary 5x: Google converts each SWE-bench point into 1.15 GPQA points. DeepSeek converts at 0.23. The gap originates in pretraining, not RLHF. The h-field is not just diagnostic — it tells you what to change. Pretraining shifts are permanent. Post-training excursions recover. Knowing which dominates determines whether to retrain or wait. THE FRAMEWORK (connects both papers)The same algebraic phase boundary works at every scale:
Half of all benchmarks now exhibit saturation (Akhtar et al., 2026). Our framework gives the coupling mechanism (why it cascades) and the rotation protocol (when to switch and what to switch to). 7 falsifiable predictions with timestamped pass/fail criteria. 5 post-cutoff releases fall within our 95% prediction interval (±16.2 pp). TRY IT
Built on EleutherAI's Pythia. Independently confirmed by AI2's OLMo. Everything is open — code, data, dashboard, steering tool. Happy to answer questions. [link] [comments] | ||||||||||||||||||