Multi-Model API Cost Governance with the Inference Router

Jun 23, 2026 07:00 AM - 1 day ago 1263

Introduction

An conclusion router is middleware that sits betwixt your exertion and your model-serving layer, directing each LLM API telephone to a exemplary due for the task alternatively than sending everything to the aforesaid endpoint. The problem it solves is simply a billing artifact astir SaaS backends create by default: erstwhile a azygous frontier exemplary handles each request, elemental classification calls astatine 94 input tokens and analyzable reasoning calls astatine 3,411 output tokens salary rates group for the harder task. The inexpensive task subsidizes the costly 1 successful the incorrect direction.

The DigitalOcean Inference Router, portion of the Inference Engine launched successful April 2026, is presently successful nationalist preview.

This tutorial builds a moving router pinch 3 task policies crossed a SaaS support backend: a low-cost classifier path, a quality-sensitive customer Q&A path, and a reasoning path. By the end, you will person a router invocable complete the modular OpenAI chat completions endpoint, per-request costs signals readable from the consequence header, and a convention pinning shape for the Q&A way that keeps KV-cache lukewarm crossed multi-turn conversations.

Key Takeaways

  • A SaaS backend that routes each petition done 1 frontier exemplary pays frontier rates connected classification calls that a 20-billion-parameter open-source exemplary handles arsenic well, astatine up to 36 times the per-request costs based connected confirmed unrecorded runs.
  • The DigitalOcean Inference Router uses a mixture-of-experts (MoE) classifier to lucifer each incoming punctual against the natural-language task descriptions you configure. A action argumentation connected the matched task past picks a circumstantial exemplary from that task’s pool. There is nary ordered norm evaluation.
  • Two abstracted credentials are required and some must beryllium to the aforesaid DigitalOcean team: a Personal Access Token ($DIGITALOCEAN_TOKEN) for the power level that creates and manages routers, and a Model Access Key ($MODEL_ACCESS_KEY) pinch an sk-do- prefix for conclusion invocation. A squad mismatch causes invocation nonaccomplishment moreover erstwhile some credentials are individually valid.
  • Manual Ranking requires nary selection_policy section successful the create body. List bid successful models[] is the ranking expression, confirmed via unrecorded API telephone connected June 16, 2026.
  • During nationalist preview, the Router itself is free. You salary only for the models that service each request. Router invocation adds a small, sub-second routing overhead per request.

Prerequisites

Before pursuing this tutorial, you need:

  • A DigitalOcean relationship astatine Tier 3 aliases higher. Claude Sonnet 4.6 and GPT-5 are commercialized models that require Tier 3+ access. Tier 1 and Tier 2 accounts are constricted to open-source models. Check your tier and limits.
  • A DigitalOcean Personal Access Token (PAT) pinch constitute entree to the Inference API, stored arsenic $DIGITALOCEAN_TOKEN. This credential is for the power level only: creating and managing routers. Do not usage it for conclusion calls.
  • A Model Access Key (MAK) from the Inference console, stored arsenic $MODEL_ACCESS_KEY. This credential carries an sk-do- prefix and is utilized exclusively for conclusion invocation.
  • Both the PAT and the MAK must beryllium to the aforesaid DigitalOcean team. A squad mismatch causes invocation failures sloppy of really the router is configured. This is the astir communal setup correction erstwhile moving crossed aggregate DO accounts.
  • curl for the control-plane calls successful this tutorial, and Python pinch the openai package (pip instal openai) for the invocation examples.
  • Familiarity pinch the OpenAI chat completions API format.

Model slugs utilized successful this tutorial: openai-gpt-oss-20b, llama3.3-70b-instruct, anthropic-claude-4.6-sonnet, openai-gpt-5. Verify existent readiness successful the model catalog.

What Is an Inference Router?

An conclusion router is simply a middleware constituent that receives LLM API requests and directs each 1 to an due exemplary based connected the task type and configured action policy, without immoderate routing logic successful your exertion code. Your exertion sends each petition to a azygous endpoint pinch "model": "router:<your-router-name>", and the router handles dispatch.

How the Inference Router Fits into the DigitalOcean Inference Engine

The Inference Engine bundles serverless inference, dedicated inference, and the Inference Router nether a unified API surface. The Router is the furniture that governs which exemplary wrong the Inference Engine serves each incoming request. It is invoked astatine the conclusion endpoint (https://inference.do-ai.run/v1/chat/completions) utilizing the modular OpenAI chat completions format. The Inference Router how-to documents the afloat parameter group for router creation and management. The available models list shows existent exemplary slugs and which tiers tin entree them.

Serverless Inference vs. Dedicated Inference: What Gets Routed and Why

Serverless conclusion runs connected shared infrastructure pinch per-request billing and automatic scaling, including scale-to-zero. Dedicated conclusion runs connected reserved compute pinch predictable latency and fixed capacity per endpoint. The Inference Router tin target both. For workloads requiring guaranteed capacity and accordant latency SLAs, the Speed Optimization aliases Manual Ranking action policies fto you designate a dedicated exemplary arsenic the first prime successful a task pool. See serverless vs. dedicated vs. batch inference for the capacity and costs tradeoffs betwixt deployment types.

How Inference Routing Works

The Inference Router is simply a semantic router, not a rule-based dispatcher. Each incoming punctual is evaluated by an MoE classifier model, which matches it against the natural-language custom_task.description fields you specify erstwhile creating the router. The matched task’s action argumentation past picks a circumstantial exemplary from that task’s pool. There is nary ordered information and nary “first matching task wins” behavior.

Task Matching and Selection Policies

Task matching is semantic: the MoE classifier sounds the afloat punctual and matches it to the closest task description. The value of your task descriptions determines lucifer accuracy. Write custom_task.description values arsenic descriptive task definitions, not arsenic labels aliases class names.

The Router supports 4 action policies:

Policy API Expression Use When
Cost Efficiency "selection_policy": { "prefer": "cheapest" } Multi-model pool; minimize walk per request
Speed Optimization "selection_policy": { "prefer": "fastest" } Multi-model pool; minimize time-to-first-token
Manual Ranking No selection_policy field. List bid successful models[] is the ranking expression. Quality-sensitive way requiring deterministic exemplary preference
Optimal "task_slug": "<preset>" DO-defined preset task types only; not disposable for civilization tasks

For Manual Ranking, omitting the selection_policy section wholly is the correct API expression. This was confirmed via a unrecorded create telephone connected June 16, 2026: the API echoes models backmost successful database order, and that bid is the policy.

Routing Approaches: Static, Semantic, and Cost-Based

Three chopped approaches picture really conclusion routers dispatch requests:

Static routing dispatches based connected fixed attributes of the request, specified arsenic URL path, a header value, aliases a petition field. It requires nary classifier and adds minimal latency, but it cannot accommodate to punctual content. A billing mobility and a bug study sent to the aforesaid endpoint look identical from a fixed router’s perspective. Use this attack erstwhile workloads are already segmented by exertion logic aliases definitive exemplary parameters.

Semantic routing evaluates the contented of the punctual to find task type earlier dispatching. The DigitalOcean Inference Router uses this attack done its MoE classifier. The classifier adds a small, sub-second routing overhead, but it enables dispatch decisions based connected existent task contented alternatively than a proxy signal. Research from EMNLP 2024 recovered that prompt-content-based routing tin amended query ratio by astir 40%, trim costs by 30%, and amended output value by 10%, depending connected workload creation (Stripelis et al., TensorOpera Router: A Multi-Model Router for Efficient LLM Inference).

Cost-aware move routing extends semantic routing pinch real-time pricing signals, existent exemplary availability, aliases value scores to prime models dynamically based connected existent conditions alternatively than fixed configuration. This is much analyzable to instrumentality and support but applicable for backends that span aggregate providers aliases request to respond to per-model pricing changes without router reconfiguration.

Open-source alternatives for teams building extracurricular DigitalOcean see the vLLM Semantic Router, which uses signal-driven semantic classification to way requests crossed exemplary pools, and llm-d-router, which provides KV-cache and load-aware routing for self-hosted Kubernetes-based serving stacks.

Request Flow from API Call to Model Response

When your exertion sends a petition to https://inference.do-ai.run/v1/chat/completions pinch "model": "router:cost-governance-demo":

  1. The Inference Router passes the punctual to its MoE classifier.
  2. The classifier matches the punctual against your configured custom_task.description values and returns the closest task.
  3. If a task matches, the task’s action argumentation picks a exemplary from that task’s models[] excavation and forwards the request.
  4. If nary task matches, aliases the matched exemplary is unavailable, the petition goes to fallback_models[].
  5. The consequence returns successful modular OpenAI chat completions format, pinch the summation of the x-model-router-selected-route consequence header identifying which task matched.

Request travel sketch showing really an incoming API telephone passes done the MoE classifier, matches a task policy, applies a action argumentation to prime a exemplary from the pool, and returns a consequence successful OpenAI format pinch the x-model-router-selected-route header. Unmatched requests way to fallback_models..png)

The exemplary section successful the consequence assemblage shows which exemplary served the request. These 2 fields together springiness you per-request routing attribution without opening the Analyze dashboard.

The Cost Problem This Solves

As benchmarked successful Metrics that Matter pinch Serverless Inference, the costs of a azygous completed reply swings astir 230 times crossed the exemplary catalog, driven almost wholly by exemplary choice, not supplier pricing differences.

The generalization taxation is what you salary erstwhile a backend uses 1 frontier exemplary for each tasks. At the token counts confirmed successful the unrecorded runs for this tutorial, a azygous classification telephone (94 successful / 80 out) costs $0.00004070 connected openai-gpt-oss-20b. The aforesaid telephone sent to Claude Sonnet 4.6 costs $0.00148200, a 36x premium. Sent to GPT-5, it costs $0.00091750, a 22.5x premium. Those multiples use to each classification petition successful your postulation volume.

At 700,000 classification requests per month, the quality betwixt routing categorize postulation to openai-gpt-oss-20b and hardcoding Claude Sonnet 4.6 is $28.49 vs. $1,037.40 per month. The routing architecture beneath captures that redeeming without changing a statement of exertion code.

Architecture: Three Paths, Three Model Tiers

This tutorial implements a three-path router reflecting the task-complexity building of a SaaS support backend.

Classifier path. Incoming support tickets are first classified into categories: billing, bug, how-to, aliases account. This is simply a short-input, categorical-output task. openai-gpt-oss-20b is the superior exemplary because it is the cheapest action successful the excavation and produces correct categorical labels connected this task type. llama3.3-70b-instruct is the fallback. Selection policy: prefer: cheapest.

Customer Q&A path. Multi-turn, user-facing questions require accordant value crossed sessions and reliable behaviour connected domain-specific content. anthropic-claude-4.6-sonnet is the superior model, llama3.3-70b-instruct is the fallback. Selection policy: Manual Ranking (Sonnet listed first, selection_policy section omitted). The reasoning for Manual Ranking complete prefer: fastest is that the fastest exemplary successful the excavation could beryllium the weaker 1 connected immoderate fixed run, and this way is quality-sensitive. Manual Ranking gives deterministic Sonnet-unless-unavailable behavior.

Reasoning path. Complex multi-step reasoning tasks warrant GPT-5’s per-request costs because the task worth is high. Note that GPT-5’s input complaint ($1.25/M) is little than Claude Sonnet 4.6’s ($3.00/M). The reasoning way costs much because it generates substantially much output tokens (3,411 successful the confirmed unrecorded tally vs. 292 for Q&A), and output tokens predominate costs connected reasoning paths sloppy of header rate.

Router fallback. llama3.3-70b-instruct catches prompts that the MoE classifier does not lucifer to immoderate configured task. This prevents unmatched requests from returning an correction and routes them to a tin open-source exemplary astatine debased cost.

 categorize to openai-gpt-oss-20b astatine $0.05/$0.45 per cardinal tokens, customer-qa to anthropic-claude-4.6-sonnet astatine $3.00/$15.00, and reasoning to openai-gpt-5 astatine $1.25/$10.00, pinch llama3.3-70b-instruct arsenic the router-level fallback for unmatched requests.

Before configuring task tiers for your ain backend, measurement your existent telephone distribution. A backend wherever 80% of postulation is reasoning requests will spot a different costs floor plan than 1 wherever 80% is classification. Run a value information connected your task categories utilizing Router Evaluation successful the Playground to corroborate that cheaper models nutrient acceptable output earlier routing unrecorded postulation to them.

Setting Up the Inference Router

Creating the Router via the API

Send a POST petition to the control-plane endpoint utilizing your PAT. The petition beneath creates each 3 task policies successful a azygous call:

curl -s -X POST "https://api.digitalocean.com/v2/gen-ai/models/routers" \ -H "Authorization: Bearer $DIGITALOCEAN_TOKEN" \ -H "Content-Type: application/json" \ -d '{ "name": "cost-governance-demo", "description": "Three-path router for SaaS support: classify, Q-and-A, reasoning", "policies": [ { "custom_task": { "name": "classify", "description": "Classify a customer support connection into precisely 1 of the pursuing categories: billing, bug, how-to, aliases account. The connection is simply a short matter from a support ticket." }, "models": ["openai-gpt-oss-20b", "llama3.3-70b-instruct"], "selection_policy": { "prefer": "cheapest" } }, { "custom_task": { "name": "customer-qa", "description": "Answer a customer-facing mobility astir the product, relationship settings, subscription plans, aliases work behavior. The mobility whitethorn beryllium portion of a multi-turn speech pinch a user." }, "models": ["anthropic-claude-4.6-sonnet", "llama3.3-70b-instruct"] }, { "custom_task": { "name": "reasoning", "description": "Perform analyzable multi-step reasoning, mathematical analysis, aliases architectural information that requires tracing done aggregate steps and producing a elaborate mentation pinch intermediate conclusions." }, "models": ["openai-gpt-5"] } ], "fallback_models": ["llama3.3-70b-instruct"] }'

Three notes connected this body:

The customer-qa argumentation has nary selection_policy field. That is correct and intentional. Omitting the section activates Manual Ranking: the Router tries models successful database order. Claude Sonnet 4.6 is listed first and is tried first.

The custom_task.description values are inputs to the MoE classifier model. Write them arsenic task definitions, not show labels. Vague descriptions for illustration “general questions” overlap pinch astir incoming prompts and degrade routing accuracy.

The PAT ($DIGITALOCEAN_TOKEN) goes successful this request. The MAK ($MODEL_ACCESS_KEY) is not utilized here.

Output

{ "model_router": { "uuid": "11f16981-ed77-8bd0-aee4-4e013e2ddde4", "name": "cost-governance-demo", "description": "Three-path router for SaaS support: classify, Q-and-A, reasoning", "regions": ["all"], "config": { "policies": [ { "custom_task": { "name": "classify", "description": "Classify a customer support connection into precisely 1 of the pursuing categories: billing, bug, how-to, aliases account. The connection is simply a short matter from a support ticket." }, "models": ["openai-gpt-oss-20b", "llama3.3-70b-instruct"], "selection_policy": { "prefer": "cheapest" } }, { "custom_task": { "name": "customer-qa", "description": "Answer a customer-facing mobility astir the product, relationship settings, subscription plans, aliases work behavior. The mobility whitethorn beryllium portion of a multi-turn speech pinch a user." }, "models": ["anthropic-claude-4.6-sonnet", "llama3.3-70b-instruct"] }, { "custom_task": { "name": "reasoning", "description": "Perform analyzable multi-step reasoning, mathematical analysis, aliases architectural information that requires tracing done aggregate steps and producing a elaborate mentation pinch intermediate conclusions." }, "models": ["openai-gpt-5"] } ], "fallback_models": ["llama3.3-70b-instruct"] }, "created_at": "2026-06-16T12:50:18Z", "updated_at": "2026-06-16T12:50:18Z" } }

Confirm 3 things successful the response: uuid is coming (save it if you request it for API-based edits aliases cleanup later), the customer-qa argumentation assemblage shows nary selection_policy field, and fallback_models contains llama3.3-70b-instruct. If you person an HTTP 400 pinch "model router sanction already exists", the sanction cost-governance-demo is already taken successful your team. Use a different name.

Creating the Router via the Control Panel

The DigitalOcean console provides a ocular router creation travel astatine AI/ML > Inference > My Routers > Create Router. For each task policy, you specify the task name, task description, exemplary pool, and action argumentation successful shape fields. The power sheet does not require curl aliases a PAT.

For the Manual Ranking argumentation connected the Q&A path, time off the action argumentation dropdown unset. Model bid successful the excavation database determines ranking. The router is instantly invocable by sanction aft creation.

Routers created done either the API aliases the console are editable afterward from the My Routers paper utilizing Edit Router, up to 3 models per task pool.

Verifying Routing Behavior

To corroborate the router was created and is queryable, retrieve it by the UUID returned successful the create response:

curl -s "https://api.digitalocean.com/v2/gen-ai/models/routers/11f16981-ed77-8bd0-aee4-4e013e2ddde4" \ -H "Authorization: Bearer $DIGITALOCEAN_TOKEN"

Output

{ "model_router": { "name": "cost-governance-demo", "config": { "policies": [ { "custom_task": { "name": "classify" }, "models": ["openai-gpt-oss-20b", "llama3.3-70b-instruct"], "selection_policy": { "prefer": "cheapest" } }, { "custom_task": { "name": "customer-qa" }, "models": ["anthropic-claude-4.6-sonnet", "llama3.3-70b-instruct"] }, { "custom_task": { "name": "reasoning" }, "models": ["openai-gpt-5"] } ], "fallback_models": ["llama3.3-70b-instruct"] } } }

A 200 consequence pinch the afloat configuration confirms the router is registered and queryable. If you person a 404, the UUID does not lucifer a router successful your team, aliases the PAT belongs to a different squad than the 1 wherever the router was created.

Invoking the Router from Your Backend

All conclusion calls usage the MAK ($MODEL_ACCESS_KEY), not the PAT. The exemplary section is "model": "router:<your-router-name>".

The 3 requests beneath show 1 punctual per task way utilizing curl.

Classifier path:

curl -s "https://inference.do-ai.run/v1/chat/completions" \ -H "Authorization: Bearer $MODEL_ACCESS_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "router:cost-governance-demo", "messages": [ { "role": "user", "content": "Classify this support connection into 1 of: billing, bug, how-to, account. Message: I was charged doubly this month." } ] }'

Output

{ "id": "chatcmpl-...", "object": "chat.completion", "model": "openai-gpt-oss-20b", "choices": [ { "message": { "role": "assistant", "content": "billing" }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 94, "completion_tokens": 80, "total_tokens": 174 } }

"model": "openai-gpt-oss-20b" confirms prefer: cheapest selected the lower-cost model. Token counts of 94 successful / 80 retired lucifer the verified unrecorded tally from June 16, 2026. The completion_tokens: 80 count for a one-word reply is expected: openai-gpt-oss-20b is simply a reasoning-style unfastened exemplary that emits soul reasoning tokens counted arsenic completion tokens earlier outputting the visible label. The reply "billing" is the visible output; the remaining tokens are the model’s soul process.

Customer Q&A path:

curl -s "https://inference.do-ai.run/v1/chat/completions" \ -H "Authorization: Bearer $MODEL_ACCESS_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "router:cost-governance-demo", "messages": [ { "role": "user", "content": "How do I reset my password if I nary longer person entree to my registered email?" } ], "max_completion_tokens": 512 }'

Output

{ "id": "chatcmpl-...", "object": "chat.completion", "model": "anthropic-claude-4.6-sonnet", "choices": [ { "message": { "role": "assistant", "content": "To reset your password without entree to your registered email, please interaction our support squad directly..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 24, "completion_tokens": 292, "total_tokens": 316 } }

"model": "anthropic-claude-4.6-sonnet" confirms Manual Ranking sent the petition to the first-listed model. Token counts 24 successful / 292 retired lucifer the June 16 verified run.

Reasoning path:

curl -s "https://inference.do-ai.run/v1/chat/completions" \ -H "Authorization: Bearer $MODEL_ACCESS_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "router:cost-governance-demo", "messages": [ { "role": "user", "content": "A work has 3 limitations pinch 99.9%, 99.95%, and 99.99% uptime. Walk done the mathematics for the mixed availability, past explicate really adding a redundant lawsuit of the weakest dependency changes it." } ], "max_completion_tokens": 4096 }'

Output

{ "id": "chatcmpl-...", "object": "chat.completion", "model": "openai-gpt-5", "choices": [ { "message": { "role": "assistant", "content": "Combined readiness is calculated by multiplying the individual uptimes: 0.999 × 0.9995 × 0.9999 = 0.99840..." }, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 53, "completion_tokens": 3411, "total_tokens": 3464 } }

"model": "openai-gpt-5" confirms the reasoning task matched and GPT-5 served it. The token count 53 successful / 3,411 retired matches the June 16 verified run. The max_completion_tokens present is group to 4096. Setting it to 1024 connected this aforesaid reasoning punctual returned "content": null pinch "finish_reason": "length" successful the confirmed unrecorded test: GPT-5 consumed the afloat fund connected soul reasoning steps and had nary tokens near for the visible answer. Set a minimum fund of 4,096 tokens connected immoderate reasoning path, and set upward for longer expected outputs.

GPT-5 useful done the Inference Router complete /v1/chat/completions without immoderate typical handling connected your end. The Router manages the dispatch transparently. However, GPT-5 whitethorn reply correctly while declining to expose soul reasoning steps successful the consequence body. Do not constitute exertion logic that depends connected chain-of-thought output being present.

The Python balanced for the classifier way utilizing the OpenAI SDK:

import os from openai import OpenAI client = OpenAI( api_key=os.environ["MODEL_ACCESS_KEY"], base_url="https://inference.do-ai.run/v1/" ) response = client.chat.completions.create( model="router:cost-governance-demo", messages=[ { "role": "user", "content": "Classify this support connection into 1 of: billing, bug, how-to, account. Message: I was charged doubly this month." } ] ) print(response.choices[0].message.content) print(f"Served by: {response.model}")

Output

billing Served by: openai-gpt-oss-20b

response.model shows which exemplary served the request. This is your per-request costs audit awesome without opening the dashboard.

Reading the Cost Signal per Request

The x-model-router-selected-route consequence header identifies which task argumentation matched. response.model identifies which exemplary served. Together, they springiness you per-request routing attribution.

To publication the consequence header successful Python, usage .with_raw_response:

import os from openai import OpenAI client = OpenAI( api_key=os.environ["MODEL_ACCESS_KEY"], base_url="https://inference.do-ai.run/v1/" ) raw = client.chat.completions.with_raw_response.create( model="router:cost-governance-demo", messages=[ { "role": "user", "content": "Classify this support connection into 1 of: billing, bug, how-to, account. Message: I was charged doubly this month." } ] ) response = raw.parse() matched_task = raw.headers.get("x-model-router-selected-route") serving_model = response.model usage = response.usage print(f"Matched task: {matched_task}") print(f"Served by: {serving_model}") print(f"Tokens in/out: {usage.prompt_tokens} / {usage.completion_tokens}")

Output

Matched task: classify Served by: openai-gpt-oss-20b Tokens in/out: 94 / 80

x-model-router-selected-route: categorize confirms the punctual matched the classifier task, not the fallback. If this header returns fallback alternatively of a named task, the punctual did not lucifer immoderate configured task description. That is the superior awesome for routing mismatches and triggers the debugging steps successful the Observability section.

Multi-Model API Cost Governance

Mapping Tasks to Cost Outcomes

Per-request costs for each path, utilizing token counts from the June 16, 2026 unrecorded runs and prices verified successful May 2026:

Path Model Tokens (in/out) Per-request cost
Classify openai-gpt-oss-20b 94 / 80 $0.00004070
Customer Q&A anthropic-claude-4.6-sonnet 24 / 292 $0.00445200
Reasoning openai-gpt-5 53 / 3,411 $0.03417625

Cost arithmetic (verified):

  • Classify: (94 × $0.05 + 80 × $0.45) / 1,000,000 = $0.00004070
  • Q&A: (24 × $3.00 + 292 × $15.00) / 1,000,000 = $0.00445200
  • Reasoning: (53 × $1.25 + 3,411 × $10.00) / 1,000,000 = $0.03417625

All prices are taxable to change. Current pricing astatine https://docs.digitalocean.com/products/inference/details/pricing/.

Using Model Tiers to Control Inference Spend

The reasoning way costs astir 840 times much per petition than the classifier way ($0.03417625 vs. $0.00004070). Routing requests to due tiers based connected task complexity keeps mean costs proportional to task value.

The GPT-5 costs building is worthy knowing precisely: GPT-5’s input complaint ($1.25/M) is little than Claude Sonnet 4.6’s ($3.00/M). The reasoning way costs much than the Q&A way because reasoning generates astir 12 times much output tokens (3,411 vs. 292 successful the unrecorded runs), and GPT-5’s output complaint ($10.00/M) applies to those tokens. Cost connected reasoning paths is dominated by output token volume. This is the aforesaid move documented successful the 230x cost-per-answer dispersed successful the Metrics that Matter pinch Serverless Inference benchmark: reasoning models make reasoning tokens billed arsenic output earlier the visible reply begins.

Cost Comparison Table: Routed vs. Hardcoded Frontier

This array uses a postulation divided of 700,000 categorize requests, 250,000 Q&A requests, and 50,000 reasoning requests per month, based connected the per-request costs confirmed successful the June 16, 2026 unrecorded runs:

Configuration Classify/month Q&A/month Reasoning/month Total/month Routing saves
Routed (tiered) $28.49 $1,113.00 $1,708.81 $2,850.30 baseline
Hardcode Claude Sonnet 4.6 $1,037.40 $1,113.00 $2,566.20 $4,716.60 39.6% ($1,866.30)
Hardcode Claude Opus 4.7 $1,729.00 $1,855.00 $4,277.00 $7,861.00 63.7% ($5,010.70)
Hardcode GPT-5 $642.25 $737.50 $1,708.81 $3,088.56 7.7% ($238.26)

Against a Claude Sonnet 4.6 baseline, routing saves 39.6% per period ($1,866.30 astatine this postulation volume). Against Claude Opus 4.7, savings are 63.7% ($5,010.70). Against GPT-5 arsenic the hardcoded model, savings are 7.7% ($238.26), because the Q&A way routed to Sonnet ($1,113.00) costs much than Q&A hardcoded connected GPT-5 ($737.50). The savings against GPT-5 travel almost wholly from the classifier path.

The 39.6% and 63.7% savings are delicate to postulation composition. A backend wherever reasoning postulation constitutes a larger stock of requests will spot smaller savings percentages, because the routing determination connected the reasoning way is simply a single-model pool: the Router adds overhead without offering a cheaper replacement connected that path. The savings travel from the categorize and Q&A tier separation.

Multi-Model Orchestration Patterns

Pattern 1: Complexity-Based Routing

The three-path router built successful this tutorial is complexity-based routing: tasks are segmented by the cognitive complexity of the required output, and each tier uses a exemplary sized for that task. The classifier way uses a 20B-parameter model, the Q&A way uses a frontier chat model, and the reasoning way uses a reasoning-optimized model.

This shape fits backends wherever task types are chopped and separable by punctual content. If your backend handles archive summarization, entity extraction, and codification procreation arsenic chopped telephone types, specify 3 task policies pinch descriptions that bespeak those output types, and the MoE classifier will way accordingly. Keep each task explanation circumstantial to its output format, because the classifier performs amended erstwhile task descriptions picture chopped output types alternatively than overlapping taxable areas.

Pattern 2: Fallback for Availability and Cost Guardrails

The fallback_models section provides a information nett for unmatched prompts. In the router created above, llama3.3-70b-instruct catches immoderate punctual the classifier does not lucifer to a named task. This besides serves arsenic a costs guardrail: unclassified requests way to the open-source fallback alternatively than silently hitting a commercialized frontier model.

The Q&A policy’s Manual Ranking building provides a 2nd fallback furniture wrong the argumentation itself. If Claude Sonnet 4.6 is unavailable, the Router falls backmost to llama3.3-70b-instruct earlier reaching the world fallback_models array. This gives the Q&A way graceful degradation without a full outage.

For backends that request strict per-request costs caps, 1 shape is to adhd a budget-checking furniture successful exertion middleware earlier the router call. If the estimated costs of a reasoning petition exceeds a threshold, the exertion reframes the petition arsenic a Q&A task aliases declines it earlier it reaches the Router. The Router itself does not enforce per-request walk limits.

Pattern 3: Session Pinning and Cache Economics

For multi-turn Q&A sessions, convention pinning keeps consequent requests successful a convention connected the aforesaid exemplary that served the first request. This prevents a speech from switching models mid-session and keeps the KV-cache lukewarm for that session’s prefix, reducing redundant input token processing.

To pin a session, walk the X-Model-Affinity header pinch a unchangeable convention identifier connected each petition successful the session. The first telephone routes usually done the MoE classifier. Subsequent calls pinch the aforesaid convention ID are served by the aforesaid exemplary without re-running the classifier. You verify pinning by reference the x-model-affinity header echoed successful the consequence and confirming that response.model is accordant crossed calls.

The OpenAI Python SDK does not expose this header natively. Use extra_headers to nonstop it and .with_raw_response to publication the echo:

import os from openai import OpenAI client = OpenAI( api_key=os.environ["MODEL_ACCESS_KEY"], base_url="https://inference.do-ai.run/v1/" ) session_id = "session-test-001" # usage a unchangeable identifier per personification convention successful production raw = client.chat.completions.with_raw_response.create( model="router:cost-governance-demo", messages=[ {"role": "user", "content": "How do I update my billing address?"} ], extra_headers={"X-Model-Affinity": session_id} ) response = raw.parse() affinity_echo = raw.headers.get("x-model-affinity") print(f"Served by: {response.model}") print(f"Affinity header echoed: {affinity_echo}") print(f"Tokens: {response.usage.prompt_tokens} successful / {response.usage.completion_tokens} out")

Output

Served by: anthropic-claude-4.6-sonnet Affinity header echoed: session-test-001 Tokens: 15 successful / 245 out

x-model-affinity: session-test-001 echoed successful the consequence confirms the Router accepted the convention identifier. Sending a 2nd petition pinch the aforesaid session_id returns the aforesaid response.model worth without routing overhead. If the echo is absent, the header was dropped aliases the endpoint does not support affinity for the matched model.

Prompt caching connected Anthropic models yields important input-cost savings connected sessions pinch agelong repeated prefixes, since cache sounds are billed astatine astir 10% of the modular input price. For OpenAI models connected DO, automatic punctual caching applies to prompts of 1,024 tokens aliases much astatine 50% disconnected the input price. Open-source models connected DO do not yet support punctual caching. Routing a convention to a different exemplary resets the cache, truthful convention pinning and cache economics are coupled: pinning is the system that keeps the cache active.

Observability and Debugging

The Analyze Dashboard

The Analyze dashboard for the Inference Router is accessible astatine AI/ML > Inference > Analyze. It shows exemplary lucifer complaint and fallback complaint crossed your router’s traffic. The features reference documents the disposable metrics. The serverless conclusion metrics reference covers per-request costs and latency signals.

 categorize 76.4%, customer-qa 22.8%, reasoning 0.8%. Model distribution shows openai-gpt-oss-20b serving the mostly of traffic.

Model lucifer complaint is the percent of requests that matched a named task policy. A complaint beneath 90% usually intends task descriptions are excessively generic aliases overlap significantly. Fallback complaint is the percent of requests that matched nary task and went to fallback_models. A precocious fallback complaint intends incoming prompts are not represented successful your task configuration.

The Playground’s Router Evaluation tab provides LLM-as-a-Judge scoring connected Completeness, Correctness, Tokens Used, and Latency. Use it to corroborate that cheaper-model routing connected the classifier way does not degrade output value comparative to a single-model baseline earlier routing unrecorded traffic.

Validating Task Match Quality

To cheque whether a circumstantial punctual routes to the expected task, nonstop it and inspect x-model-router-selected-route:

import os from openai import OpenAI client = OpenAI( api_key=os.environ["MODEL_ACCESS_KEY"], base_url="https://inference.do-ai.run/v1/" ) raw = client.chat.completions.with_raw_response.create( model="router:cost-governance-demo", messages=[{"role": "user", "content": "My invoice shows a copy charge"}] ) matched_task = raw.headers.get("x-model-router-selected-route") print(f"Matched task: {matched_task}")

Output

Matched task: classify

classify is the expected lucifer for a billing complaint. If the header returns customer-qa aliases fallback instead, the task descriptions for categorize and customer-qa overlap excessively much. The classifier matched the Q&A explanation alternatively of the categorize one, aliases it recovered nary lucifer astatine all. Make the categorize explanation much circumstantial to categorical output and the Q&A explanation much circumstantial to prose reply output, past re-test pinch the aforesaid prompt.

Common Misconfigurations

Task descriptions that are excessively generic. A explanation for illustration “answer questions astir the product” overlaps pinch astir each punctual successful a support backend. Write descriptions that bespeak the circumstantial output type: “Answer a customer-facing mobility astir relationship settings, subscription plans, aliases billing history, wherever the consequence requires retrieving aliases explaining account-specific information.” Specificity to the output format helps the classifier separate tasks reliably.

Task descriptions that overlap. If categorize and customer-qa descriptions are semantically similar, the classifier will way ambiguous prompts inconsistently betwixt them. Classify produces a class label; customer-qa produces a prose answer. Describe those different output types successful the description, not conscionable the taxable area.

Missing fallback_models. A router pinch nary fallback_models returns an correction for unmatched prompts. Configure astatine slightest 1 fallback model, preferably a tin open-source exemplary pinch debased per-request cost.

Pool size exceeding the limit. Each task excavation holds a maximum of 3 models. The Router will cull create aliases edit requests that transcend this limit. See limits and quotas for existent constraints.

Comparing Inference Routing Approaches

Rule-Based Routing

Rule-based routing dispatches requests based connected structural properties: URL path, a header value, a JSON field, aliases the worth of the exemplary parameter itself. It requires nary classifier, adds nary latency from contented evaluation, and is wholly deterministic. The limitation is that it cannot accommodate to punctual content. A petition pinch routing logic baked into the URL way cannot alteration behaviour erstwhile the punctual contented changes, because the dispatch determination was made earlier the punctual was read.

Rule-based routing is the correct starting constituent erstwhile workloads are already segmented by exertion logic, specified arsenic abstracted endpoints for different task types aliases an definitive task parameter passed by the client.

Semantic Routing

Semantic routing evaluates punctual contented to find task type earlier dispatching. The DigitalOcean Inference Router implements semantic routing done its MoE classifier. The classifier adds a small, sub-second routing overhead per request, but it enables dispatch decisions based connected existent task contented alternatively than a proxy signal, which intends routing is transparent to the exertion and does not require changes erstwhile task patterns evolve.

Research from EMNLP 2024 connected the TensorOpera Router recovered that prompt-content-based routing tin execute a 40% betterment successful query efficiency, a 30% costs reduction, and a 10% value summation comparative to single-model deployment astatine akin costs (Stripelis et al., TensorOpera Router: A Multi-Model Router for Efficient LLM Inference).

Cost-Aware Dynamic Routing

Cost-aware move routing extends semantic routing pinch real-time signals: unrecorded pricing, existent exemplary latency measurements, aliases per-model value scores from a continuous information pipeline. The router selects the optimal exemplary astatine invocation clip based connected existent conditions, not fixed configuration. This is much elastic but importantly much analyzable to instrumentality and operate.

For teams moving astatine ample standard crossed aggregate providers, cost-aware move routing tin retrieve savings that semantic routing misses erstwhile pricing changes aliases exemplary readiness varies. For astir SaaS backends connected a azygous provider, semantic routing pinch well-tuned task descriptions achieves the mostly of the costs use astatine overmuch little operational complexity.

Decision Table: Choosing the Right Strategy

Workload Recommended Strategy Reason
Tasks already segmented by endpoint aliases petition field Rule-based No classifier overhead; afloat deterministic
Mixed-complexity tasks done a azygous endpoint Semantic (DO Inference Router) Per-prompt dispatch; nary app codification changes required
Multi-provider, high-volume, price-sensitive backend Cost-aware dynamic Adapts to real-time pricing and readiness changes
Single-model workload, nary meaningful task variation No router Router adds overhead without routing benefit
Compliance situation requiring auditable exemplary selection Rule-based aliases nary router Semantic routing is probabilistic, not afloat explainable

When to Use This Pattern and When Not To

Use the Inference Router when:

Your backend handles requests pinch chopped task-complexity tiers that disagree meaningfully successful exemplary requirements. A backend wherever 70% of requests are classification calls and 10% are analyzable reasoning tasks is simply a beardown fit. Cost savings standard pinch the proportionality of cheap-task traffic, and the Router captures them without modifying exertion code.

Your backend runs agentic pipelines pinch aggregate sequential exemplary calls astatine different complexity levels. Each telephone successful the pipeline routes independently based connected its punctual content, and convention pinning keeps multi-turn supplier sessions connected the aforesaid exemplary for cache consistency. For a deeper look astatine agentic workload patterns, spot inference routing and exemplary task matching.

You want costs governance without routing logic successful your exertion layer. The Router handles dispatch; your exertion sees a azygous endpoint and a accordant consequence format.

Do not usage the Inference Router when:

Your workload is azygous successful task complexity. A backend wherever each petition requires the aforesaid exemplary provides nary routing benefit, and the sub-second overhead adds latency pinch nary costs savings.

You are successful a compliance aliases audit situation that requires deterministic, auditable exemplary action connected each request. Semantic routing is probabilistic: classification accuracy is precocious but not guaranteed, and the dispatched exemplary depends connected punctual contented successful ways that are not afloat explainable from the petition alone.

The models you request are extracurricular DigitalOcean’s catalog. The Inference Router routes to models disposable wrong the Inference Engine only. For workloads that span aggregate providers, a self-hosted semantic router aliases cost-aware move router is the due architecture.

You request a break-even study betwixt serverless and dedicated conclusion earlier committing to a routing architecture. See dedicated vs. serverless conclusion astatine scale for the capacity and costs crossover points. The information privateness implications of conclusion routing are covered successful the data privateness documentation.

Troubleshooting

Invocation fails pinch an authentication correction contempt individually valid credentials.

The MAK ($MODEL_ACCESS_KEY) and PAT ($DIGITALOCEAN_TOKEN) beryllium to different DigitalOcean teams. Both must beryllium to the aforesaid team, aliases invocation fails sloppy of router configuration. In the pre-draft verification runs for this tutorial, each early probe failures were caused by this mismatch, not by immoderate Router aliases GPT-5 limitation. Check your squad rank successful the DO console. Regenerate credentials nether the correct squad and shop them arsenic the correct situation variables earlier retrying.

A commercialized exemplary (Claude aliases GPT-5) returns an entree correction connected invocation.

Your relationship is beneath Tier 3. Claude Sonnet 4.6 and GPT-5 require Tier 3+ accounts. Tier 1 and Tier 2 accounts are constricted to open-source models. Check your tier and upgrade earlier retrying. Reconfigure the router pinch open-source-only pools if an upgrade is not feasible.

The reasoning way returns "content": null pinch "finish_reason": "length".

The max_completion_tokens fund is excessively small. GPT-5 and different reasoning models walk a information of the token fund connected soul reasoning steps earlier generating the visible answer. Setting max_completion_tokens: 1024 connected the reasoning punctual successful this tutorial returned content: null successful the confirmed unrecorded test. The aforesaid punctual completed successfully astatine max_completion_tokens: 4096. Set a minimum of 4,096 tokens connected immoderate reasoning path, and set upward if your prompts expect longer elaborate outputs.

The router create telephone returns HTTP 400 pinch "model router sanction already exists".

The sanction is already taken successful your team. The afloat correction shows "id": "invalid_argument" and "message": "rpc error: codification = InvalidArgument desc = exemplary router sanction already exists". Choose a different sanction for the caller router.

x-model-router-selected-route consistently returns fallback.

No task explanation matched the incoming prompt. Common causes: task descriptions are excessively generic, descriptions overlap significantly, aliases the punctual type is not represented successful your configuration. Check the Analyze dashboard for fallback complaint trends, past make each task explanation much circumstantial and chopped from the others. Test individual prompts utilizing the .with_raw_response shape from the “Reading the Cost Signal per Request” conception to corroborate which task they lucifer earlier routing unrecorded traffic.

Cleaning Up

To delete the router created successful this tutorial, nonstop a DELETE petition utilizing the UUID from the create response:

curl -s -X DELETE \ "https://api.digitalocean.com/v2/gen-ai/models/routers/<your-router-uuid>" \ -H "Authorization: Bearer $DIGITALOCEAN_TOKEN"

During nationalist preview the Router is free, truthful location is nary billing effect from leaving it running. Delete it erstwhile you nary longer request it to support your router database cleanable earlier the Router reaches GA pricing.

FAQ

What Is the DigitalOcean Inference Router?

An conclusion router is simply a middleware constituent that receives LLM API requests and directs each 1 to an due exemplary based connected the task type and configured action policy. It sits betwixt the customer exertion and the model-serving layer, enabling move exemplary action without changes to exertion code. The DigitalOcean Inference Router uses a semantic MoE classifier to lucifer each punctual against the natural-language task descriptions you specify erstwhile creating the router.

How Does the DigitalOcean Inference Router Differ from a Standard API Gateway?

A modular API gateway handles authentication, complaint limiting, and petition forwarding to a fixed backend. The DigitalOcean Inference Router adds semantic task matching: an MoE classifier sounds each incoming prompt, matches it against your configured task descriptions, and dispatches it to the due exemplary pool. The action argumentation connected that excavation past picks the circumstantial exemplary by cost, speed, aliases definitive ranking, without immoderate routing logic required successful your exertion code.

Can the Inference Router Reduce LLM API Costs?

Yes. By routing simpler tasks to smaller, little costly models and reserving frontier models for high-complexity tasks, the Router reduces mean costs per petition crossed a mixed workload. At a postulation divided of 700,000 categorize requests, 250,000 Q&A requests, and 50,000 reasoning requests per month, the routed configuration costs $2,850.30 vs. $4,716.60 for a Claude Sonnet 4.6 baseline, a 39.6% reduction. A uniform-complexity workload sees minimal savings; backends pinch a precocious stock of classification aliases summarization requests spot worldly costs reduction.

What Task-Matching and Selection Policies Does the DigitalOcean Inference Router Support?

The Router supports 4 action policies: Cost Efficiency (prefer: cheapest), Speed Optimization (prefer: fastest), Manual Ranking (models tried successful the bid you list, nary selection_policy section required), and Optimal (DO-defined preset task types only, not disposable for civilization tasks). Task matching is semantic: an MoE classifier matches each incoming punctual against the natural-language custom_task.description fields you supply erstwhile creating the router. There is nary ordered norm evaluation. Refer to Inference Router how-to guide for the existent afloat database of supported parameters.

What Is the Difference Between Serverless Inference and Dedicated Inference successful the Context of Routing?

Serverless conclusion runs connected shared infrastructure pinch per-request billing and automatic scaling, including scale-to-zero. Dedicated conclusion runs connected reserved compute pinch predictable latency and fixed capacity. The DigitalOcean Inference Router routes requests to some serverless and dedicated conclusion backends. With Speed Optimization aliases Manual Ranking action policies, dedicated models go selectable for workloads requiring guaranteed capacity and accordant latency SLAs. See the Inference documentation for existent configuration options for each deployment type.

Does the Inference Router Support Agentic Workloads?

Yes. DigitalOcean designed the Inference Router arsenic portion of the April 2026 Inference Engine motorboat specifically to support agentic workload scaling. Agentic workloads typically impact aggregate sequential exemplary calls pinch varying complexity, which benefits from per-request routing to appropriately sized exemplary tiers. Use X-Model-Affinity convention pinning to support multi-turn supplier sessions connected the aforesaid exemplary for KV-cache consistency.

How Do I Debug a Router Configuration That Is Not Matching arsenic Expected?

Open the Analyze dashboard and cheque the exemplary lucifer complaint and fallback rate. A precocious fallback complaint intends the Router is not matching incoming prompts to immoderate configured task, usually because custom_task.description fields are excessively generic aliases overlap excessively overmuch pinch each other. Make each task explanation narrower and much distinct, past re-test utilizing the x-model-router-selected-route header connected individual prompts. Also corroborate that the MODEL_ACCESS_KEY utilized for invocation belongs to the aforesaid DO squad arsenic the PAT utilized to create the router; a squad mismatch causes invocation nonaccomplishment sloppy of routing configuration.

Is the DigitalOcean Inference Router Compatible pinch the OpenAI API Format?

Yes. The Router is invoked via the OpenAI-compatible endpoint astatine https://inference.do-ai.run/v1/chat/completions. Set "model": "router:<your-router-name>" successful the petition body. The consequence follows the modular OpenAI chat completions format, pinch the summation of the x-model-router-selected-route header indicating which task argumentation matched. Confirm existent compatibility specifications astatine Inference Router how-to guide.

Does GPT-5 Work Through the Inference Router complete Chat Completions?

Yes, confirmed. Despite GPT-5 requiring the Responses API erstwhile called straight connected OpenAI’s platform, the DigitalOcean Inference Router serves GPT-5 transparently complete /v1/chat/completions. No typical handling is required successful the customer application. GPT-5 whitethorn reply correctly while declining to expose soul reasoning steps; do not constitute exertion logic that depends connected chain-of-thought output being coming successful the consequence body.

What Account Tier Is Required to Use Claude aliases GPT-5 successful a Router Policy?

Commercial models from Anthropic and OpenAI require a Tier 3 aliases higher DigitalOcean account. Tier 1 and Tier 2 accounts person entree to open-source models only. Verify your tier successful the DO console earlier configuring router task policies that see Claude aliases GPT-5 exemplary slugs.

Conclusion

This tutorial built a three-path Inference Router for a SaaS support backend, configured task policies utilizing Cost Efficiency, Manual Ranking, and single-model selection, invoked each way done the OpenAI-compatible chat completions endpoint, and publication per-request costs signals from the consequence header and exemplary field. The monthly costs comparison shows a 39.6% savings against a Claude Sonnet 4.6 baseline and 63.7% against Claude Opus 4.7 astatine a postulation divided of 700,000 classify, 250,000 Q&A, and 50,000 reasoning requests per month. The confirmed per-request costs from the June 16, 2026 unrecorded runs are $0.00004070 connected the classifier path, $0.00445200 connected Q&A, and $0.03417625 connected reasoning.

With this router successful place, you tin govern LLM conclusion walk astatine the exemplary action furniture without modifying exertion code, publication the dispatched exemplary and matched task connected each consequence for per-request costs attribution, and use convention pinning connected the Q&A way to support KV-cache lukewarm crossed multi-turn conversations. Adding sum for caller task types requires only a caller task argumentation pinch a well-written description; the exertion endpoint stays the same.

For a deeper look astatine really to take which metrics to measurement crossed your conclusion stack earlier configuring exemplary tiers, spot Metrics that Matter pinch Serverless Inference. For workloads that require predictable latency SLAs astatine fixed capacity alternatively than serverless autoscaling, spot dedicated vs. serverless conclusion astatine scale. For the related tutorial connected building a cost-aware AI support API pinch the Inference Router, spot Cost-Aware AI Support API pinch the Inference Router. The Inference Router is successful nationalist preview astatine clip of writing; cheque the Inference Engine merchandise page for GA status.

Creative CommonsThis activity is licensed nether a Creative Commons Attribution-NonCommercial- ShareAlike 4.0 International License.

More