Building adaptive routing logic in Go for an Open source LLM gateway – Bifrost

Working on an LLM gateway (Bifrost)- Code is open source: https://github.com/maxim-ai/bifrost, ran into an interesting problem: how do you route requests across multiple LLM providers when failures happen gradually?

Traditional load balancing assumes binary states – up or down. But LLM API degradations are messy. A region starts timing out, some routes spike in errors, latency drifts up over minutes. By the time it's a full outage, you've already burned through retries and user patience.

Static configs don't cut it. You can't pre-model which provider/region/key will degrade and how.

The challenge: build adaptive routing that learns from live traffic and adjusts in real time, with <10µs overhead per request. Had to sit on the hot path without becoming the bottleneck.

Why Go made sense:

Needed lock-free scoring updates across concurrent requests
EWMA (exponentially weighted moving averages) for smoothing signals without allocations
Microsecond-level latency requirements ruled out Python/Node
Wanted predictable GC pauses under high RPS

How it works: Each route gets a continuously updated score based on live signals – error rates, token-adjusted latency outliers (we call it TACOS lol), utilization, recovery momentum. Routes traffic from top-scoring candidates with lightweight exploration to avoid overfitting to a single route.

When it detects rate-limit hits (TPM/RPM), it remembers and allocates just enough traffic to stay under limits going forward. Automatic fallbacks to healthy routes when degradation happens.

Result: <10µs overhead, handles 5K+ RPS, adapts to provider issues without manual intervention.

Running in production now. Curious if others have tackled similar real-time scoring/routing problems in Go where performance was critical?

submitted by /u/dinkinflika0
[link] [comments]