Routing and Reliability - Agumbe AI Gateway

Routing and reliability are core parts of how Agumbe AI Gateway turns raw model access into production-ready infrastructure. When your application sends a request to the gateway, the system does more than forward that request to a provider. It first resolves the model, checks the request type, determines which route candidates are available, applies policy and usage controls, and then executes the request with reliability rules such as retries, fallbacks, timeout handling, and circuit breakers. For customers integrating Agumbe into real applications, this matters because production AI systems need more than model connectivity. They need predictable behavior when providers are slow, rate-limited, temporarily unavailable, or when platform teams want to change routing strategy without rewriting application code.

Why this matters

Direct model integrations often begin simply. A service chooses one model, makes one request, and returns one response. That works for early development, but it becomes harder to operate as traffic grows. Teams eventually need answers to practical questions such as:

What happens if the selected model is unavailable?
How do we switch providers without changing application code?
How do we prefer one model but fall back to another?
How do we limit latency for one workload but allow longer execution for another?
How do we stop repeatedly sending traffic to an unstable upstream?

Agumbe AI Gateway addresses these concerns through a routing layer and a reliability layer. Together, these layers let you keep your application contract stable while improving execution behavior centrally.

Two important ideas

There are two related but different concepts to understand.

Routing

Routing answers the question: which model should this request go to? This includes:

resolving aliases into concrete model targets
validating whether the model supports the requested endpoint
selecting one or more route candidates
choosing ordering when multiple candidates exist

Reliability

Reliability answers the question: how should the gateway behave when execution is slow or fails? This includes:

request timeout handling
retries
fallbacks
weighted candidate selection
circuit breakers

Routing decides where a request should go. Reliability decides how the gateway behaves while trying to fulfill it.

How routing works

When a request reaches the gateway, the model field is resolved before the provider call is made. At a high level, the flow looks like this:

the gateway reads the requested model value
it checks whether that value is an alias or a direct model ID
it resolves the request into a canonical model target
it validates that the model supports the requested endpoint
it builds a route plan
it executes that plan against one or more route candidates

This means your application does not need to know how model routing is implemented internally. It sends a stable request, and the gateway handles the rest.

Model resolution

Model resolution is the first routing step. A request can specify either:

an Agumbe alias such as smart-default, cheap-fast, reasoning, or embed-default
a canonical model ID exposed by the gateway catalog

Examples:

{ "model": "smart-default" }

{ "model": "@anthropic/claude-sonnet-4" }

When the gateway receives a request, it resolves the value into a canonical model target with information such as:

requested model
canonical model
provider
upstream model
request kind
alias status

This resolution step makes the rest of execution deterministic.

Endpoint compatibility

Agumbe distinguishes between two request kinds:

chat
embeddings

This matters because the gateway does not allow a model to be used with the wrong endpoint. For example:

a chat-capable model can be used with POST /api/v1/llm/chat/completions
an embeddings-capable model can be used with POST /api/v1/llm/embeddings

If a request tries to use a model that resolves to the wrong kind, the gateway returns an invalid_model error. This validation protects applications from sending invalid traffic and makes the API behavior easier to reason about.

Route plans

Once the gateway resolves the requested model, it builds a route plan. A route plan describes:

the original requested model
the request kind
the ordered list of route candidates
how many retry attempts are allowed for each candidate

If there is no custom routing rule for the model, the route plan is simple. The gateway uses the resolved model as the single candidate and executes it once. If there is a custom routing rule, the route plan may include:

multiple candidates
retry counts per candidate
a maximum number of route attempts
weighted selection behavior

This gives platform teams a way to shape runtime execution without forcing application teams to change request code.

Default routing behavior

If no custom route configuration exists, the gateway uses a straightforward behavior:

resolve the requested model
use that resolved model as the primary candidate
execute one attempt

This keeps the default behavior simple and predictable. For many customers, this is enough in early integration phases.

Custom routing behavior

When a routing rule is configured, the gateway can define multiple candidates for a requested model. That means a single application-facing model name can map to a richer runtime strategy. A routing rule can include:

a list of candidate models
a retry count for each candidate
a weight for candidate selection
a maximum number of total attempts

This is useful when you want one logical model entry to support more resilient or more flexible runtime behavior.

Weighted candidate selection

When multiple routing candidates exist, the gateway can use weighted selection to determine candidate ordering. In practical terms, this means:

some candidates can be preferred more often than others
lower-priority candidates can still remain available as fallback options
the gateway does not have to use a rigid fixed order every time

This is useful when you want a preferred model most of the time, but still want traffic distribution or fallback coverage across additional models. Weighted routing is especially helpful for:

gradual rollout strategies
balancing between quality and cost
reliability tuning
introducing backup models without making them the primary default

Retries

Retries are applied at the route-candidate level. If a candidate is configured with multiple retry attempts, the gateway can retry that same candidate before moving on to the next one. This helps when failures are temporary, such as:

transient upstream errors
temporary rate limiting
short-lived provider instability

Retries improve resilience without immediately switching to a fallback model. The gateway only retries failures that are treated as retryable. In general, retryable failures are upstream execution failures, not local validation or policy errors.

Fallbacks

Fallbacks happen when one candidate fails and the route plan includes another candidate. For example, a route plan may define:

one preferred model
one or more backup models

If the preferred candidate fails with a retryable error, the gateway can move to the next candidate in the route plan. This gives customers a cleaner production story. Your application still asks for one stable model name, but the gateway can continue working through alternate execution paths when necessary. Fallback behavior is one of the strongest reasons to use aliases in production. It lets you separate the application contract from the runtime execution strategy.

Timeouts

Every request to an upstream provider runs with a timeout. Timeouts are important because production systems cannot wait indefinitely for a response. Even a correct answer becomes operationally expensive if it arrives too slowly for the workload. Agumbe supports:

a default request timeout
provider-level timeout overrides
model-level timeout overrides

This lets teams tune latency expectations more precisely. For example, you may want:

shorter timeouts for user-facing workloads
slightly longer timeouts for analytical or asynchronous workloads
special handling for a specific model that is known to be slower

Timeouts are part of the reliability layer because they prevent stuck or excessively slow upstream requests from degrading the whole system.

Circuit breakers

Circuit breakers protect the gateway from repeatedly sending traffic to an unstable upstream target. When enabled for a provider or model, the circuit breaker tracks consecutive failures. If the failure threshold is reached, the circuit opens for a cooldown period. While the circuit is open:

the gateway does not continue sending new requests to that route target
the request can fail fast or move through other available candidates, depending on the route plan

After the cooldown period, the gateway can try the target again. Circuit breakers help reduce repeated failure storms and improve overall resilience when an upstream dependency is unstable. They are especially useful when:

one provider is experiencing a partial outage
one model is consistently failing
repeated retries would only add latency and error volume

What reliability protects against

Reliability controls are useful for several common production cases.

Temporary provider instability

If an upstream provider has a transient error, the gateway can retry or fall back instead of immediately failing the request.

Rate limit pressure

If a model or provider returns a retryable rate-limit condition, the gateway can use its configured execution plan rather than forcing application code to handle every case itself.

Slow responses

If an upstream model becomes too slow, timeout rules keep your application from hanging indefinitely.

Repeated failures

If one target continues failing, circuit breakers help stop the gateway from repeatedly sending traffic into a broken path.

What reliability does not replace

Reliability controls improve execution behavior, but they do not remove the need for good application design. You should still:

keep your own service timeouts sensible
handle gateway errors cleanly
monitor latency and failure trends
use appropriate app-level guardrails
choose stable model aliases for production traffic

The gateway helps centralize execution strategy, but your application should still be built with normal production discipline.

Request flow with routing and reliability

A production request typically follows this order:

authenticate the caller
parse and validate the request
resolve the requested model
select the app policy
load guardrail policy
apply usage controls
build the route plan
apply request-side guardrails
execute the route plan
apply response-side guardrails
log the request
emit usage events
return the response with timing and cost metadata

This order is important because routing and reliability do not exist in isolation. They work together with authentication, guardrails, and observability.

Routing and guardrails

Routing and guardrails are closely connected. For example:

a request may resolve to a model that is blocked by an app’s allowed model list
a request may be capped by a token policy before it reaches the provider
an app-level rate limit may block the request before any routing attempt occurs

This means that a request is not routed only by model logic. It is routed within the boundaries of the app’s policy. That is one reason Agumbe uses app-level guardrails and request context alongside routing logic.

Routing and aliases

Aliases are especially valuable when routing behavior evolves over time. A stable alias such as smart-default gives your application a durable contract. Behind that contract, the gateway can:

change the primary model
introduce retries
add fallback candidates
apply weighted candidate selection
tune timeouts
tune circuit breakers

This is the cleanest way to improve runtime behavior without changing application-facing model names. For most teams, this is the right production pattern: applications use aliases, and the gateway owns route behavior centrally.

Observing routing behavior

Agumbe exposes timing and request metadata that help you understand how the request was processed. Successful responses may include headers such as:

x-agumbe-timing-total-ms
x-agumbe-timing-model-resolve-ms
x-agumbe-timing-guardrail-config-ms
x-agumbe-timing-guardrail-input-ms
x-agumbe-timing-provider-ms
x-agumbe-timing-guardrail-output-ms
x-agumbe-timing-request-log-ms
x-agumbe-timing-usage-emit-ms
x-agumbe-timing-side-effects-ms
x-agumbe-timing-gateway-overhead-ms
x-agumbe-estimated-cost-usd

These fields make it easier to answer questions such as:

how long model resolution took
how much time was spent with the upstream provider
how much latency came from gateway overhead
whether side effects such as logging and usage emission were significant

Request logs also record useful fields such as:

requested model
provider
upstream model
request status
latency
token usage
estimated cost
error code

This gives teams practical visibility into how routing decisions play out in production.

Common routing and reliability errors

A few error patterns appear frequently in this part of the system.

Invalid model

Returned when a model cannot be resolved or does not match the endpoint kind. Example:

{
  "error": {
    "message": "Model embed-default resolves to embeddings and cannot be used with the chat endpoint",
    "type": "invalid_request_error",
    "param": "model",
    "code": "invalid_model"
  }
}

Route unavailable

Returned when no usable route candidate is configured for the requested model. Example:

{
  "error": {
    "message": "No routing candidates are configured for this model",
    "type": "api_error",
    "param": null,
    "code": "route_unavailable"
  }
}

Unsupported provider

Returned when the gateway cannot execute the required capability for the selected provider target.

Request timeout

Returned when the upstream request exceeds the configured timeout window.

Provider error

Returned when the upstream execution fails and the gateway cannot successfully complete the route plan.

Circuit open

Returned when a circuit breaker is currently open for the selected route target.

Best practices

Prefer aliases for production traffic

Aliases give the gateway more room to improve routing and resilience over time without forcing application changes.

Keep routing strategy centralized

Do not push model selection and fallback logic into every application service unless you have a very specific reason to do so.

Tune for workload type

Different workloads need different reliability behavior. User-facing requests may need tighter timeouts. Background workflows may tolerate more retries.

Use observability data

Timing headers and request logs are not just diagnostics. They help you tune route behavior based on real traffic.

Keep fallback chains intentional

More fallback candidates are not always better. A smaller, well-understood route plan is easier to operate than a large, opaque one.

Pair reliability with guardrails and usage controls

A resilient route is only one part of production readiness. It should work together with app-level policies, model controls, and request monitoring.

Recommended starting point

If you are adopting Agumbe AI Gateway for the first time, a strong starting pattern is:

use aliases such as smart-default and embed-default
begin with simple default routing
add retries only where they provide clear value
introduce fallbacks deliberately, not automatically everywhere
set reasonable timeout expectations for your workload
observe real latency and failure behavior before making routing more complex

This gives you a stable foundation without over-engineering your route strategy too early.

Next steps

Once you understand routing and reliability, the next page to read is Request Logging and Observability, where you can see how request execution, timing, usage, and errors show up operationally.

Start here

Product

Cookbooks

Documentation Index

​Why this matters

​Two important ideas

​Routing

​Reliability

​How routing works

​Model resolution

​Endpoint compatibility

​Route plans

​Default routing behavior

​Custom routing behavior

​Weighted candidate selection

​Retries

​Fallbacks

​Timeouts

​Circuit breakers

​What reliability protects against

​Temporary provider instability

​Rate limit pressure

​Slow responses

​Repeated failures

​What reliability does not replace

​Request flow with routing and reliability

​Routing and guardrails

​Routing and aliases

​Observing routing behavior

​Common routing and reliability errors

​Invalid model

​Route unavailable

​Unsupported provider

​Request timeout

​Provider error

​Circuit open

​Best practices

​Prefer aliases for production traffic

​Keep routing strategy centralized

​Tune for workload type

​Use observability data

​Keep fallback chains intentional

​Pair reliability with guardrails and usage controls

​Recommended starting point

​Next steps