The Problem: One API is a Single Point of Failure
Two months ago, a 503 error from an AI API provider killed user sessions mid-conversation. The developer’s app relied solely on OpenAI’s GPT-4 for real-time responses. When the outage hit, requests timed out, then failed. The manual fix—updating code and redeploying—took an hour. That’s an hour of downtime.
The Naive First Attempt
The first fix was a simple try-except fallback: try OpenAI, if it fails, try Anthropic. Here’s the code:
import openai
import anthropic
def generate_response(prompt):
try:
return openai.ChatCompletion.create(model="gpt-4", messages=[{"role": "user", "content": prompt}])
except:
try:
return anthropic.complete(prompt=prompt, model="claude-v1")
except:
raise Exception("Both providers failed")
This approach had four flaws:
- No retries for transient errors.
- Fixed fallback order—if OpenAI is down, Anthropic takes all load, but what if it also fails?
- No timeouts: a slow provider could hang the entire system.
- No insight into failure rates.
The Solution: Weighted Multi-Provider Router
The developer built a Python library with three mechanisms:
- Weighted round-robin selection – Assign weights to providers (e.g., 3 for GPT-4, 1 for Claude, 1 for a free model). Requests are distributed proportionally. If a provider fails repeatedly, its weight is temporarily reduced.
- Exponential backoff with jitter – Retry failed requests with increasing delays, randomized to avoid thundering herd.
- Circuit breaker – If a provider fails 3 times in 60 seconds, stop sending requests to it for a cooldown period.
Here’s the core implementation:
import asyncio
import random
import time
from typing import Dict, List, Callable, Awaitable
class AIProvider:
def __init__(self, name: str, weight: int, callable: Callable[[str], Awaitable[str]]):
self.name = name
self.weight = weight
self.callable = callable
self.failures = 0
self.last_failure_time = 0
self.circuit_open = False
class MultiProviderRouter:
def __init__(self, providers: List[AIProvider], circuit_breaker_threshold: int = 3, circuit_breaker_timeout: int = 60):
self.providers = providers
self.circuit_breaker_threshold = circuit_breaker_threshold
self.circuit_breaker_timeout = circuit_breaker_timeout
def _select_provider(self):
available = [p for p in self.providers if not p.circuit_open]
if not available:
raise RuntimeError("All providers are in circuit breaker mode")
total_weight = sum(p.weight for p in available)
r = random.uniform(0, total_weight)
cumulative = 0
for p in available:
cumulative += p.weight
if r <= cumulative:
return p
return available[-1]
async def call(self, prompt: str, max_retries: int = 3):
for attempt in range(max_retries):
provider = self._select_provider()
try:
result = await provider.callable(prompt)
provider.failures = 0
return result
except Exception as e:
provider.failures += 1
provider.last_failure_time = time.time()
if provider.failures >= self.circuit_breaker_threshold:
provider.circuit_open = True
asyncio.create_task(self._reset_circuit(provider))
delay = (2 ** attempt) + random.random()
await asyncio.sleep(delay)
raise RuntimeError("All retries exhausted")
async def _reset_circuit(self, provider):
await asyncio.sleep(self.circuit_breaker_timeout)
provider.circuit_open = False
provider.failures = 0
To use it, wrap API calls as async functions and instantiate the router:
```python
async def call_openai(prompt: str) -> str:
# your real implementation
...
async def call_anthropic(prompt: str) -> str:
...
router = MultiProviderRouter([
AIProvider("openai", weight=3, callable=call_openai),
AIProvider("anthropic", weight=2, callable=call_anthropic),
# AIProvider("local", weight=1, callable=call_local_small_model),
])
result = await router.call("Explain quantum entanglement like I'm 5")
The developer also added Prometheus metrics (counters and histograms) to track success/failure rates and adjust weights based on real data.
Lessons Learned
- Quality vs. cost: Weighting GPT-4 higher kept quality high, but during slowdowns, the router used cheaper models, saving money. Trade-off: occasional lower-quality responses during outages.
- Circuit breaker tuning: Too sensitive (low threshold) switches too often, losing context. Too lenient keeps hitting a dead provider. Settled on 3 failures in 60 seconds.
- Idempotency: The router doesn’t guarantee exactly-once delivery. If a request times out but actually succeeded, downstream may get a duplicate. Handle that on your end.
- Debugging is harder: When a response looks weird, you need to check which provider served it. Added a
X-Providerheader in responses.
What to Do Differently Next Time
Start with a simple fallback and add metrics first before building the full router. The circuit breaker and weights came from seeing real failure patterns. Also consider using a hosted service that does this for you (e.g., ai.interwestinfo.com—though the author hasn’t used it). The technique is the same whether you build or buy.
Currently, the router handles 10,000+ requests a day with zero manual intervention. During a 6-hour outage, users barely noticed because the router silently switched to Anthropic, then to a local model.
The Real Takeaway
Resilience isn’t about eliminating failures—it’s about surviving them gracefully. A smart fallback strategy is cheap to implement and pays for itself the first time your primary API goes down. Don’t wait until your phone buzzes with angry users.
What’s your backup plan for AI API failures? Share your setup—simple fallback, multi-provider, or something totally different?






