Production-grade Python library for routing across every major LLM provider. Define your model catalog in YAML; chat, stream, tool-call, and structure-output against OpenAI, Anthropic, Bedrock, Vertex, Gemini, Groq, and 12 more — without rewriting your code.
pip install ai5labs-relayThe features OSS gateways treat as "v2" — observability, audit logging, PII redaction, governance — ship in v0.1 of Relay.
Connect any Model Context Protocol server (GitHub, Slack, Postgres, Playwright) and use its tools against any provider — including providers without native MCP support.
JSON Schema once → compiles to OpenAI strict and Anthropic shapes today; Gemini, Bedrock, Cohere shapes ready for v0.2 native adapters. Mastra-style instruction injection for unsupported keywords.
Pass a Pydantic model; get a validated instance back. Works against every provider. Auto-retry on validation failure.
Every response carries source, confidence, and fetched_at. Live AWS Pricing / Azure Retail / OpenRouter, falling back to a maintained snapshot.
Distinguish rate-limit, context-window, content-policy, and auth errors. Each gets the right behavior: retry, fall-back, or fail-fast.
Native gen_ai.* spans, token-usage histograms, cost histograms. Works with Datadog, Honeycomb, Langfuse, Arize, Phoenix out of the box.
Regex / Presidio redaction before the prompt leaves your process. Structured audit events to pluggable sinks (file, S3, Splunk, callback).
Hub-level exact-match cache plus Anthropic prompt-cache passthrough via CacheHint markers. Compose them; users decide.
relay models compare sonnet 4o flash. relay models recommend --task code --budget cheap. Pick the right model with public benchmarks side-by-side.
Your model catalog lives in version-controlled YAML. Aliases likesmart andfast in code; the actual provider, model id, credentials, and routing strategy live in one file your team can review.
# models.yaml
version: 1
models:
fast: { target: groq/llama-3.3-70b-versatile,
credential: $env.GROQ_API_KEY }
smart: { target: anthropic/claude-sonnet-4-5,
credential: $env.ANTHROPIC_API_KEY }
vision: { target: openai/gpt-4o-mini,
credential: $env.OPENAI_API_KEY }
groups:
default:
strategy: fallback
members: [smart, fast]from relay import Hub
async with Hub.from_yaml("models.yaml") as hub:
resp = await hub.chat(
"smart",
messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(resp.text, resp.cost_usd)pip install to first call in 60 secondsPydantic-typed responses, async-native, with cost provenance attached. Strict typing under mypy --strict. Tested on Python 3.10–3.13.
Tool-call argument fragments are merged by index, not id — fixing theLiteLLM #20711 bug out of the gate. Hypothesis property tests verify the invariant.
async for ev in hub.stream("smart", messages=[...]):
if ev.type == "text_delta":
print(ev.text, end="", flush=True)
elif ev.type == "thinking_delta": # Anthropic extended thinking
...
elif ev.type == "end":
print(f"\n[{ev.response.latency_ms:.0f}ms, "
f"${ev.response.cost_usd:.4f}]")from relay import Hub
from relay.mcp import MCPManager
mcp = MCPManager()
await mcp.add_stdio(
"github",
command="npx",
args=["-y", "@modelcontextprotocol/server-github"],
)
hub = Hub.from_yaml("models.yaml")
hub.attach_mcp(mcp)
# Tools from any MCP server work against ANY provider
tools = await hub.mcp_tools()
resp = await hub.chat("smart", messages=[...], tools=tools)Other gateways force you to pick MCP-aware providers. Relay translates MCP tool schemas into each provider's native shape, so a GitHub MCP server works against Bedrock Claude as easily as against OpenAI.
The catalog ships with the library — 434 models, with public benchmark scores where the provider has published them. Rankings below are the 10 frontier models with full scores, sorted by composite quality index.
| Model | Quality | In / Out per 1M |
|---|---|---|
| openai/o1 | 85 | $15.00 / $60.00 |
| anthropic/claude-opus-4-5 | 80 | $15.00 / $75.00 |
| google/gemini-2.5-pro | 80 | $1.25 / $10.00 |
| openai/o3-mini | 78 | $1.10 / $4.40 |
| deepseek/deepseek-reasoner | 76 | $0.55 / $2.19 |
| anthropic/claude-sonnet-4-5 | 73 | $3.00 / $15.00 |
| deepseek/deepseek-chat | 72 | $0.32 / $0.89 |
| xai/grok-3 | 72 | $3.00 / $15.00 |
| openai/gpt-4o | 71 | $2.50 / $10.00 |
| anthropic/claude-3-5-sonnet-20241022 | 70 | $3.00 / $15.00 |
Sourced from each provider's published numbers; verify before quoting. Browse all 434 models →
Static, rule-based recommender ships free in the library. Per-query semantic routing — looking at the prompt and choosing the model automatically — is a hosted-gateway feature.
Filter the catalog by task, budget, and required capabilities. Deterministic, offline, no LLM in the loop — same answer every time for the same constraints.
# Top 5 cheap code models, JSON for an agent
relay models recommend \
--task code --budget cheap \
--limit 5 --json
# In Python
from relay.catalog import get_catalog
top = sorted(
(r for r in get_catalog().values()
if r.benchmarks),
key=lambda r: r.benchmarks.quality_index or 0,
reverse=True,
)[:5]Free forever. Apache-2.0. Runs offline against the catalog snapshot.
Looks at the actual prompt and picks the model from its content — intent, length, structured-output requirements, tool-use patterns, language. Re-evaluated as new models ship.
Whichever model you pick, Relay shouldn't slow it down. Three runs against an identical mock backend, vs raw httpx and LiteLLM:
| Metric | Relay | LiteLLM | Verdict |
|---|---|---|---|
Cold start (import) | 110–152 ms | 1,304–2,078 ms | 5–19× faster |
| Streaming TTFT p50 | 13.4–14.6 ms | 15.4–18.6 ms | ~13–27% faster |
| Chat overhead p50 | 2.2–3.1 ms | 3.3–13.4 ms | Tied / occasionally faster |
| Chat overhead p99 stability | 19–23 ms range | 23–41 ms range | Consistent tail |
Single machine, single Python version, 1000 chat req @ concurrency 20, 50 ms mock backend. Run yourself — full methodology + raw numbers in BENCHMARKS.md.
The library is free forever. The hosted gateway is in design-partner mode — join the waitlist for early access.
Apache-2.0. The full feature set. Run it in your own infrastructure with your own provider keys.
BYOK proxy with multi-tenant ops, plus a per-query semantic router that picks the model from the prompt.
VPC deployment, SOC 2 attestation, BAA / DPA paperwork, 24/7 SLA, custom features.
We're onboarding 10 design partners for the first cohort. Free during the program; influence the roadmap.
We'll only email you about Relay updates. Unsubscribe with one click.