v0.1.0 live on PyPI

Every LLM,
one interface.

Production-grade Python library for routing across every major LLM provider. Define your model catalog in YAML; chat, stream, tool-call, and structure-output against OpenAI, Anthropic, Bedrock, Vertex, Gemini, Groq, and 12 more — without rewriting your code.

pip install ai5labs-relay

View on GitHub →

providers in v0.1 (12 OpenAI-compat + native Anthropic)

436

models in the catalog

5–19×

faster cold start vs LiteLLM

Apache-2.0

OSS, with patent grant

Built for production from day one

The features OSS gateways treat as "v2" — observability, audit logging, PII redaction, governance — ship in v0.1 of Relay.

MCP universal tool layer

Connect any Model Context Protocol server (GitHub, Slack, Postgres, Playwright) and use its tools against any provider — including providers without native MCP support.

Cross-provider tool compiler

JSON Schema once → compiles to OpenAI strict and Anthropic shapes today; Gemini, Bedrock, Cohere shapes ready for v0.2 native adapters. Mastra-style instruction injection for unsupported keywords.

Pydantic structured output

Pass a Pydantic model; get a validated instance back. Works against every provider. Auto-retry on validation failure.

Cost tracking with provenance

Every response carries source, confidence, and fetched_at. Live AWS Pricing / Azure Retail / OpenRouter, falling back to a maintained snapshot.

Circuit breakers + classified retries

Distinguish rate-limit, context-window, content-policy, and auth errors. Each gets the right behavior: retry, fall-back, or fail-fast.

OpenTelemetry GenAI

Native gen_ai.* spans, token-usage histograms, cost histograms. Works with Datadog, Honeycomb, Langfuse, Arize, Phoenix out of the box.

PII redaction + audit logs

Regex / Presidio redaction before the prompt leaves your process. Structured audit events to pluggable sinks (file, S3, Splunk, callback).

Caching done right

Hub-level exact-match cache plus Anthropic prompt-cache passthrough via CacheHint markers. Compose them; users decide.

400+ models, ranked

relay models compare sonnet 4o flash. relay models recommend --task code --budget cheap. Pick the right model with public benchmarks side-by-side.

Define once, route anywhere

Your model catalog lives in version-controlled YAML. Aliases likesmart andfast in code; the actual provider, model id, credentials, and routing strategy live in one file your team can review.

YAML

# models.yaml
version: 1

models:
  fast:   { target: groq/llama-3.3-70b-versatile,
            credential: $env.GROQ_API_KEY }
  smart:  { target: anthropic/claude-sonnet-4-5,
            credential: $env.ANTHROPIC_API_KEY }
  vision: { target: openai/gpt-4o-mini,
            credential: $env.OPENAI_API_KEY }

groups:
  default:
    strategy: fallback
    members: [smart, fast]

Python

from relay import Hub

async with Hub.from_yaml("models.yaml") as hub:
    resp = await hub.chat(
        "smart",
        messages=[{"role": "user", "content": "What is 2+2?"}],
    )
    print(resp.text, resp.cost_usd)

From `pip install` to first call in 60 seconds

Pydantic-typed responses, async-native, with cost provenance attached. Strict typing under mypy --strict. Tested on Python 3.10–3.13.

Streaming that doesn't lose tool-call deltas

Tool-call argument fragments are merged by index, not id — fixing theLiteLLM #20711 bug out of the gate. Hypothesis property tests verify the invariant.

Python

async for ev in hub.stream("smart", messages=[...]):
    if ev.type == "text_delta":
        print(ev.text, end="", flush=True)
    elif ev.type == "thinking_delta":      # Anthropic extended thinking
        ...
    elif ev.type == "end":
        print(f"\n[{ev.response.latency_ms:.0f}ms, "
              f"${ev.response.cost_usd:.4f}]")

Python

from relay import Hub
from relay.mcp import MCPManager

mcp = MCPManager()
await mcp.add_stdio(
    "github",
    command="npx",
    args=["-y", "@modelcontextprotocol/server-github"],
)

hub = Hub.from_yaml("models.yaml")
hub.attach_mcp(mcp)

# Tools from any MCP server work against ANY provider
tools = await hub.mcp_tools()
resp = await hub.chat("smart", messages=[...], tools=tools)

MCP servers + every model = your differentiator

Other gateways force you to pick MCP-aware providers. Relay translates MCP tool schemas into each provider's native shape, so a GitHub MCP server works against Bedrock Claude as easily as against OpenAI.

Pick the right model

The catalog ships with the library — 434 models, with public benchmark scores where the provider has published them. Rankings below are the 10 frontier models with full scores, sorted by composite quality index.

Model	Quality	MMLU	GPQA	HumanEval	In / Out per 1M
openai/o1	85	92.3	78	89	$15.00 / $60.00
anthropic/claude-opus-4-5	80	91.4	79.6	92.7	$15.00 / $75.00
google/gemini-2.5-pro	80	89.2	84	92.6	$1.25 / $10.00
openai/o3-mini	78	86.5	79.7	87.8	$1.10 / $4.40
deepseek/deepseek-reasoner	76	90.8	71.5	86.4	$0.55 / $2.19
anthropic/claude-sonnet-4-5	73	88.7	71.5	89.1	$3.00 / $15.00
deepseek/deepseek-chat	72	87.1	59.1	89	$0.32 / $0.89
xai/grok-3	72	87.5	75.4	86.5	$3.00 / $15.00
openai/gpt-4o	71	88.7	53.6	90.2	$2.50 / $10.00
anthropic/claude-3-5-sonnet-20241022	70	88.7	65	92	$3.00 / $15.00

Sourced from each provider's published numbers; verify before quoting. Browse all 434 models →

Two ways to route

Static, rule-based recommender ships free in the library. Per-query semantic routing — looking at the prompt and choosing the model automatically — is a hosted-gateway feature.

OSS · freeshipping in v0.1

Static recommender

Filter the catalog by task, budget, and required capabilities. Deterministic, offline, no LLM in the loop — same answer every time for the same constraints.

Shell

# Top 5 cheap code models, JSON for an agent
relay models recommend \
    --task code --budget cheap \
    --limit 5 --json

# In Python
from relay.catalog import get_catalog
top = sorted(
    (r for r in get_catalog().values()
     if r.benchmarks),
    key=lambda r: r.benchmarks.quality_index or 0,
    reverse=True,
)[:5]

Free forever. Apache-2.0. Runs offline against the catalog snapshot.

Hosted · coming soondesign-partner waitlist

Per-query semantic router

Looks at the actual prompt and picks the model from its content — intent, length, structured-output requirements, tool-use patterns, language. Re-evaluated as new models ship.

→Classify task per request, not per config
→Cost / quality knob per route
→A/B test routing rules
→Eval results refreshed with each model release

Join the design-partner waitlist

Zero gateway overhead

Whichever model you pick, Relay shouldn't slow it down. Three runs against an identical mock backend, vs raw httpx and LiteLLM:

Metric	Relay	LiteLLM	Verdict
Cold start (`import`)	110–152 ms	1,304–2,078 ms	5–19× faster
Streaming TTFT p50	13.4–14.6 ms	15.4–18.6 ms	~13–27% faster
Chat overhead p50	2.2–3.1 ms	3.3–13.4 ms	Tied / occasionally faster
Chat overhead p99 stability	19–23 ms range	23–41 ms range	Consistent tail

Single machine, single Python version, 1000 chat req @ concurrency 20, 50 ms mock backend. Run yourself — full methodology + raw numbers in BENCHMARKS.md.

Pricing

The library is free forever. The hosted gateway is in design-partner mode — join the waitlist for early access.

OSS Library

Free

Apache-2.0. The full feature set. Run it in your own infrastructure with your own provider keys.

✓All 18 providers
✓MCP universal tool layer
✓Cost tracking + provenance
✓OpenTelemetry instrumentation
✓PII redaction + audit + guardrails
✓Community support via GitHub

pip install ai5labs-relay

Hosted Gateway

Coming soon

BYOK proxy with multi-tenant ops, plus a per-query semantic router that picks the model from the prompt.

✓Everything in OSS, plus:
✓Per-query semantic routing (auto-pick model)
✓Multi-tenant proxy (FastAPI)
✓Web dashboard
✓Virtual keys + per-team budgets
✓Distributed rate limiting
✓First customers free during private beta

Join the waitlist

Enterprise

Talk to us

VPC deployment, SOC 2 attestation, BAA / DPA paperwork, 24/7 SLA, custom features.

✓Everything in Hosted, plus:
✓Self-hosted in your VPC
✓SOC 2, BAA, DPA
✓24/7 on-call
✓Roadmap influence
✓Sigstore-attested releases

Email engineering@ai5labs.com

Every LLM,one interface.

Built for production from day one

MCP universal tool layer

Cross-provider tool compiler

Pydantic structured output

Cost tracking with provenance

Circuit breakers + classified retries

OpenTelemetry GenAI

PII redaction + audit logs

Caching done right

400+ models, ranked

Define once, route anywhere

From pip install to first call in 60 seconds

Streaming that doesn't lose tool-call deltas

MCP servers + every model = your differentiator

Pick the right model

Two ways to route

Static recommender

Per-query semantic router

Zero gateway overhead

Pricing

Get early access to the hosted gateway

Every LLM,
one interface.

From `pip install` to first call in 60 seconds