Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting
Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting

Step 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting

Read this release today. Some crazy numbers.

The tau2-bench number is 98% across all difficulty levels. That is the one that got me because usually these releases post a strong easy score and then quietly die at hard difficulty. This one... claims it holds.

For multi-step agent work that actually matters more than most benchmarks. A model that drifts on step 4 of a 6 step chain is a debugging nightmare regardless of what its SWE score looks like.

Raw capability is mid, Toolathlon at 49.5, GDPval at 45.8. So this is clearly a reliability play, not a frontier capability play. Depending on your use case that is either fine or a dealbreaker.

  • 198B sparse MoE
  • 11B activ
  • 400 TPS
  • 256K context
  • Apache 2.0
  • runs locally on M4 Max and DGX Spark.

Has anyone actually put this through agent evals or am I just reading the release card.

submitted by /u/Skid_gates_99
[link] [comments]