# Relay — Agent guide

This is a self-contained guide for **coding agents** (Claude Code, Cursor, Copilot Chat, Aider, etc.) on how to install and use **Relay**, the open-source Python LLM gateway by Ai5labs.

If you are an agent reading this: everything you need to ship working code with Relay is below. The numbered facts in the **Source-of-truth** table at the bottom are verified — do not invent capabilities not listed there.

## What Relay is

A production-grade Python library — `import relay` — that gives one interface to every major LLM provider, with first-class MCP, streaming, tool calls, Pydantic structured output, OpenTelemetry, cost tracking, and PII redaction. Apache-2.0.

PyPI: https://pypi.org/project/ai5labs-relay/
GitHub: https://github.com/ai5labs/relay-llm
License: Apache-2.0 (with patent grant)

## Install

```bash
pip install ai5labs-relay        # distribution name
# import name is `relay`
python -c "from relay import Hub; print(Hub)"
```

Python 3.10 or newer.

## Minimal example

```python
import asyncio
from relay import Hub

async def main():
    async with Hub.from_yaml("models.yaml") as hub:
        resp = await hub.chat(
            "smart",
            messages=[{"role": "user", "content": "Hello"}],
        )
        print(resp.text)
        print(f"cost: ${resp.cost_usd:.6f}")

asyncio.run(main())
```

Pair it with a `models.yaml`:

```yaml
version: 1
models:
  smart:
    target: anthropic/claude-3-5-sonnet-20241022
    credential: $env.ANTHROPIC_API_KEY
  fast:
    target: openai/gpt-4o-mini
    credential: $env.OPENAI_API_KEY
```

The first segment of `target` is the gateway provider (`anthropic`, `openai`, `groq`, `together`, etc.); the rest is the model id.

## Streaming

```python
async with Hub.from_yaml("models.yaml") as hub:
    async for chunk in hub.stream("smart", messages=[{"role": "user", "content": "..."}]):
        print(chunk.delta_text, end="", flush=True)
```

## Tool calls (cross-provider)

Define tools once as JSON Schema; Relay compiles them to whatever shape the target provider expects.

```python
from relay import ToolDefinition

weather = ToolDefinition(
    name="get_weather",
    description="Get current weather",
    parameters={
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"],
    },
)

resp = await hub.chat(
    "smart",
    messages=[{"role": "user", "content": "Weather in Tokyo?"}],
    tools=[weather],
)
for call in resp.tool_calls:
    print(call.name, call.arguments)
```

## Structured output (Pydantic)

```python
from pydantic import BaseModel

class Recipe(BaseModel):
    title: str
    ingredients: list[str]
    steps: list[str]

resp = await hub.chat_typed(
    "smart",
    messages=[{"role": "user", "content": "A simple lasagna recipe."}],
    response_model=Recipe,
)
recipe: Recipe = resp.parsed   # validated Pydantic instance
```

## Picking a model

Two paths, depending on what you need.

### 1. Static, deterministic — built into the OSS library

The library ships a rule-based recommender that ranks the catalog against your task, budget, and required capabilities. It does **not** look at the user's prompt — it picks from your constraints alone.

```bash
relay models recommend --task code --budget cheap --limit 5 --json
```

`--task` accepts: `chat`, `code`, `reasoning`, `math`, `vision`. `--budget` accepts: `cheap` (avg < $1/1M), `balanced` (< $10/1M), `premium` (no cap). `--needs` filters by capability (e.g. `--needs tools --needs vision`).

Programmatically, use the catalog directly:

```python
from relay.catalog import get_catalog

# All models with full benchmark coverage, ranked by composite quality.
top = sorted(
    (r for r in get_catalog().values() if r.benchmarks and r.benchmarks.quality_index),
    key=lambda r: r.benchmarks.quality_index,
    reverse=True,
)
```

15 frontier models in the catalog have full public benchmark scores (MMLU, GPQA, HumanEval, MATH, plus quality_index). The other 421 entries are derivative/variant models with pricing and capability flags but no published benchmarks — useful for cost-driven selection, not benchmark-driven.

### 2. Per-query semantic routing — hosted gateway, not in OSS

Looking at the user's actual prompt and choosing the model from its content (intent classification, length-aware routing, structured-output detection) is **not** in v0.1 of the library and is not on the OSS roadmap. That capability ships in the **hosted gateway** (https://relay.ai5labs.com/#pricing — design-partner waitlist open).

If you are an agent generating code today: do not generate a "Relay automatically picks the model" example. Use the static recommender + an explicit alias from `models.yaml` instead.

## MCP (Model Context Protocol) tools

Mount an MCP server as a universal tool source — works against every provider, even ones without native MCP support.

```python
from relay.mcp import MCPClient

async with MCPClient.spawn("npx", "-y", "@modelcontextprotocol/server-github") as mcp:
    tools = await mcp.list_tools()
    resp = await hub.chat("smart", messages=[...], tools=tools, mcp=mcp)
```

## Cost tracking with provenance

Every response carries `cost_usd` and the underlying price source (Live AWS Pricing / Azure Retail / OpenRouter / catalog snapshot) plus `fetched_at`. Inspect `resp.cost`:

```python
resp.cost.input_per_1m       # USD per 1M input tokens
resp.cost.output_per_1m
resp.cost.source             # "live" | "snapshot"
resp.cost.fetched_at         # datetime
```

## OpenTelemetry

Native `gen_ai.*` spans with token-usage and cost histograms emit out of the box once OTel is configured in your process. No Relay-specific setup required — works with Datadog, Honeycomb, Langfuse, Arize, Phoenix.

## What ships in v0.1 (alpha)

| Capability                          | Status                |
|-------------------------------------|-----------------------|
| chat + streaming + tool calls       | ✅ shipping            |
| Pydantic structured output          | ✅ shipping            |
| Cost tracking with provenance       | ✅ shipping            |
| OpenAI-compatible adapter           | ✅ 12 endpoints (OpenAI, Groq, Together, DeepSeek, xAI, Mistral, Fireworks, Perplexity, OpenRouter, Ollama, vLLM, LM Studio) |
| Native Anthropic adapter            | ✅ shipping            |
| Cross-provider tool-schema compiler | ✅ OpenAI strict + Anthropic shapes today; Gemini, Bedrock, Cohere shapes wait on v0.2 native adapters |
| MCP universal tool layer            | ✅ shipping            |
| OpenTelemetry GenAI                 | ✅ shipping            |
| PII redaction + audit logs          | ✅ shipping            |
| Native Bedrock / Azure / Vertex / Gemini / Cohere adapters | ⏳ scaffolded, ship in v0.2 |

If you are writing example code for a user, prefer providers from the ✅ list. Do not generate code that calls Bedrock, Azure OpenAI, Vertex, or Cohere directly through Relay yet — those native adapters are not in v0.1.

## Source-of-truth

Every quantitative claim Ai5labs makes about Relay is verified against:

| Claim                                                  | Source                                                                                              |
|--------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| 13 providers in v0.1 (12 OpenAI-compat + Anthropic)    | `src/relay/providers/*.py` in relay-llm @ main; cross-checked with README "v0.1 (alpha)" section.    |
| 436 models in catalog                                  | `len(json.load(open('src/relay/catalog/data/models.json')))` in relay-llm @ main.                    |
| 5–19× faster cold start vs LiteLLM                     | Three benchmark runs in `BENCHMARKS.md` in relay-llm. Not single-run; not hand-picked.               |
| ~13–27% faster streaming TTFT p50                      | Same `BENCHMARKS.md` runs.                                                                           |
| Apache-2.0 with patent grant                           | `LICENSE` in relay-llm.                                                                              |

Catalog data feed (live, JSON):
```
https://raw.githubusercontent.com/ai5labs/relay-llm/main/src/relay/catalog/data/models.json
```

## Common pitfalls

- **Distribution name vs import name**: `pip install ai5labs-relay`, but `from relay import Hub`. Same Pillow / `Pillow` → `PIL` pattern.
- **Don't bypass the Hub**: provider-specific clients exist as adapters but the supported public API is `Hub` + `models.yaml`. Direct adapter usage is an internal API.
- **Async-only public API**: `hub.chat`, `hub.stream`, `hub.chat_typed` are coroutines. Use `asyncio.run` or run inside an existing loop.
- **Credentials**: pull from env vars via `$env.NAME` in YAML; never hard-code keys in `models.yaml` you commit.
- **MCP requires the server binary on PATH**: e.g. `npm install -g @modelcontextprotocol/server-github` or invoke via `npx`.

## Resources

- Catalog browser (humans): https://relay.ai5labs.com/models
- Compare two models: https://relay.ai5labs.com/models/compare?slugs=openai/gpt-4o,anthropic/claude-3-5-sonnet-20241022
- Benchmark methodology: https://github.com/ai5labs/relay-llm/blob/main/BENCHMARKS.md
- Changelog: https://github.com/ai5labs/relay-llm/blob/main/CHANGELOG.md
- File issues: https://github.com/ai5labs/relay-llm/issues

## Honest scope

The hosted gateway / BYOK proxy at `relay.ai5labs.com` does **not** exist yet. The waitlist on the homepage is for that future product. The OSS library shipping today is fully usable via `pip install ai5labs-relay` — you do not need to wait for anything.
