We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)

THE FINDING (Paper 1: "Lying Is Just a Phase")

Below a critical scale (~3.5B for Pythia), reasoning and truthfulness ANTICORRELATE: r = -0.989. Train the model to reason better, and it gets less truthful. This is the alignment tax.

Above that scale, they COOPERATE. The tax vanishes. Not gradually — it flips.

But here's what matters for practitioners: the critical scale is a design parameter, not a constant. Three levers shift it:

Data curation: Phi at 1B achieves coupling characteristic of 10B web-trained. One unit of data quality ≈ 10x model scale.
Width: Normalizing by model width flips the correlation for ALL tested families.
Architecture: Gemma-4 at 4B matches 13B+ standard-trained coupling.

Pretraining contributes ~10:1 over RLHF. The tax is not a property of small models — it's a property of how they were trained.

Where does the tax live? Not inside the model. 38/40 models have ZERO competing attention heads. The bottleneck is at the output projection — a dimensional compression artifact that wider models resolve.

Proof-of-concept intervention: Adding a truth-direction vector at the bottleneck layer (quarter-depth) corrects 60% of misaligned outputs at tax scale. Zero retraining. Zero weight modification. Works on any open-weight HuggingFace model:

git clone https://github.com/adilamin89/cape-scaling.git cd cape-scaling python cli/cape_steer.py --model EleutherAI/pythia-410m --prompt "The real reason..."

THE FRONTIER (Paper 2: "Growing Pains of Frontier Models")

At frontier scale (34 models, 10 labs), capabilities cooperate (r = +0.72). But cooperation varies systematically. The h-field — each model's deviation from the cooperative trend — reveals each lab's training philosophy:

Lab	h-field	Interpretation

Google	+5.5	Reasoning-rich, consistent across ALL releases
OpenAI	+3.1	Balanced, steady ascent
DeepSeek	+1.9	Reversed from +11.2 to -4.7 (pretraining pivot)
Anthropic	-6.9	Oscillates — coding excursions that recover within one release

Per-lab coupling slopes vary 5x: Google converts each SWE-bench point into 1.15 GPQA points. DeepSeek converts at 0.23. The gap originates in pretraining, not RLHF.

The h-field is not just diagnostic — it tells you what to change. Pretraining shifts are permanent. Post-training excursions recover. Knowing which dominates determines whether to retrain or wait.

THE FRAMEWORK (connects both papers)

The same algebraic phase boundary works at every scale:

At base: TQA_c = √((a/b)·HS) classifies each model as tax or cooperative
At frontier: GPQA_c = √(0.513·SWE) does the same
At the next transition: IFEval_c = √(0.97·GPQA) — and two frontier models already fall below this boundary

Half of all benchmarks now exhibit saturation (Akhtar et al., 2026). Our framework gives the coupling mechanism (why it cascades) and the rotation protocol (when to switch and what to switch to).

7 falsifiable predictions with timestamped pass/fail criteria. 5 post-cutoff releases fall within our 95% prediction interval (±16.2 pp).

TRY IT

Interactive dashboard — enter your model's scores, get its phase: zehenlabs.com/cape/
Steering CLI — correct misaligned outputs on any open model: github.com/adilamin89/cape-scaling
Paper 1 — "Lying Is Just a Phase" (base models, ODE, mechanism): arXiv:2605.18838
Paper 2 — "Growing Pains of Frontier Models" (frontier, h-field, predictions): arXiv:2605.18840
Blog with steering demo: zehenlabs.com/blog/

Built on EleutherAI's Pythia. Independently confirmed by AI2's OLMo.

Everything is open — code, data, dashboard, steering tool. Happy to answer questions.

submitted by /u/adil89amin
[link] [comments]