InferWeave is a lightweight inference gateway that routes requests across LLM providers. Swap models without changing code. Add fallback chains in one line. Drop-in compatible with the OpenAI SDK.
MIT Licensed · Changelog · Launching Q3 2026
from openai import OpenAI
# Just change the base URL. That's it.
client = OpenAI(
base_url="https://api.inferweave.cloud/v1",
api_key="iw_sk_..." # your InferWeave key
)
response = client.chat.completions.create(
model="gpt-4o", # or "claude-4-sonnet", "gemini-2.5-pro"
messages=[{"role": "user", "content": "Hello"}],
# InferWeave-specific: automatic fallback
extra_body={
"fallback": ["claude-4-sonnet", "gemini-2.5-pro"],
"budget": "$0.02" # max cost per request
}
)
Not another wrapper. InferWeave runs as a stateless proxy with sub-5ms overhead. Configure routing rules, deploy, and forget.
Define ordered fallback lists per request. If GPT-4o returns a 5xx or times out, the request automatically retries against your next provider. Configurable via the fallback parameter or in your inferweave.yaml.
Set a per-request budget with "budget": "$0.01". InferWeave picks the cheapest model that meets your latency and quality constraints. Route simple classification tasks to Haiku, complex reasoning to Opus.
Every request logs model used, latency (TTFB + total), input/output tokens, and cost. Export via OpenTelemetry or query the built-in dashboard. No sampling — every request, every field.
Swap base_url to api.inferweave.cloud/v1 and you're done. Works with the official OpenAI Python/Node SDKs, LangChain, LlamaIndex, and any client that speaks the OpenAI chat completions format.
Configuration lives in a single YAML file. No vendor lock-in, no proprietary SDK.
Create an inferweave.yaml in your project root. Declare models, fallback chains, budgets, and retry policies.
Run infer deploy to push your config. Your routing rules are live in under 2 seconds, globally distributed across edge nodes.
Change your base_url to our endpoint. Existing code works unchanged. Monitor everything from the dashboard or via OTLP export.
# inferweave.yaml
version: 1
models:
primary: gpt-4o
fallback:
- claude-4-sonnet
- gemini-2.5-pro
routing:
strategy: cost-optimized
max_budget_per_request: $0.03
timeout_ms: 30000
retries: 2
observability:
export: otlp
endpoint: https://otel.yourinfra.dev
sample_rate: 1.0 # 100% of requests
keys:
openai: ${OPENAI_API_KEY}
anthropic: ${ANTHROPIC_API_KEY}
google: ${GOOGLE_API_KEY}
Two common patterns: streaming with fallback, and cost-aware model selection via the routing config.
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.inferweave.cloud/v1",
apiKey: process.env.INFERWEAVE_KEY,
});
const stream = await client.chat.completions.create({
model: "claude-4-sonnet",
stream: true,
messages: [
{ role: "user", content: prompt }
],
// If Claude is down, fall back to GPT-4o
extra_body: {
fallback: ["gpt-4o", "gemini-2.5-pro"]
}
});
for await (const chunk of stream) {
process.stdout.write(
chunk.choices[0]?.delta?.content ?? ""
);
}
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.inferweave.cloud/v1",
api_key=os.environ["INFERWEAVE_KEY"]
)
# Let InferWeave pick the cheapest model
# that can handle classification
response = client.chat.completions.create(
model="auto", # cost-optimized selection
messages=[{
"role": "user",
"content": f"Classify: {ticket}"
}],
extra_body={
"budget": "$0.002",
"prefer": "low-latency"
}
)
# Response includes which model was used
print(response.model)
# => "claude-4-haiku" (cheapest that fit)
You pay for InferWeave routing + your underlying model costs (passed through at cost, no markup).