Working on an LLM gateway (Bifrost)- Code is open source: https://github.com/maxim-ai/bifrost, ran into an interesting problem: how do you route requests across multiple LLM providers when failures happen gradually?
Traditional load balancing assumes binary states – up or down. But LLM API degradations are messy. A region starts timing out, some routes spike in errors, latency drifts up over minutes. By the time it's a full outage, you've already burned through retries and user patience.
Static configs don't cut it. You can't pre-model which provider/region/key will degrade and how.
The challenge: build adaptive routing that learns from live traffic and adjusts in real time, with <10µs overhead per request. Had to sit on the hot path without becoming the bottleneck.
Why Go made sense:
- Needed lock-free scoring updates across concurrent requests
- EWMA (exponentially weighted moving averages) for smoothing signals without allocations
- Microsecond-level latency requirements ruled out Python/Node
- Wanted predictable GC pauses under high RPS
How it works: Each route gets a continuously updated score based on live signals – error rates, token-adjusted latency outliers (we call it TACOS lol), utilization, recovery momentum. Routes traffic from top-scoring candidates with lightweight exploration to avoid overfitting to a single route.
When it detects rate-limit hits (TPM/RPM), it remembers and allocates just enough traffic to stay under limits going forward. Automatic fallbacks to healthy routes when degradation happens.
Result: <10µs overhead, handles 5K+ RPS, adapts to provider issues without manual intervention.
Running in production now. Curious if others have tackled similar real-time scoring/routing problems in Go where performance was critical?
[link] [comments]