Routing and reliability are core parts of how Agumbe AI Gateway turns raw model access into production-ready infrastructure. When your application sends a request to the gateway, the system does more than forward that request to a provider. It first resolves the model, checks the request type, determines which route candidates are available, applies policy and usage controls, and then executes the request with reliability rules such as retries, fallbacks, timeout handling, and circuit breakers. For customers integrating Agumbe into real applications, this matters because production AI systems need more than model connectivity. They need predictable behavior when providers are slow, rate-limited, temporarily unavailable, or when platform teams want to change routing strategy without rewriting application code.Documentation Index
Fetch the complete documentation index at: https://agumbe.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Why this matters
Direct model integrations often begin simply. A service chooses one model, makes one request, and returns one response. That works for early development, but it becomes harder to operate as traffic grows. Teams eventually need answers to practical questions such as:- What happens if the selected model is unavailable?
- How do we switch providers without changing application code?
- How do we prefer one model but fall back to another?
- How do we limit latency for one workload but allow longer execution for another?
- How do we stop repeatedly sending traffic to an unstable upstream?
Two important ideas
There are two related but different concepts to understand.Routing
Routing answers the question: which model should this request go to? This includes:- resolving aliases into concrete model targets
- validating whether the model supports the requested endpoint
- selecting one or more route candidates
- choosing ordering when multiple candidates exist
Reliability
Reliability answers the question: how should the gateway behave when execution is slow or fails? This includes:- request timeout handling
- retries
- fallbacks
- weighted candidate selection
- circuit breakers
How routing works
When a request reaches the gateway, the model field is resolved before the provider call is made. At a high level, the flow looks like this:- the gateway reads the requested model value
- it checks whether that value is an alias or a direct model ID
- it resolves the request into a canonical model target
- it validates that the model supports the requested endpoint
- it builds a route plan
- it executes that plan against one or more route candidates
Model resolution
Model resolution is the first routing step. A request can specify either:- an Agumbe alias such as smart-default, cheap-fast, reasoning, or embed-default
- a canonical model ID exposed by the gateway catalog
- requested model
- canonical model
- provider
- upstream model
- request kind
- alias status
Endpoint compatibility
Agumbe distinguishes between two request kinds:- chat
- embeddings
- a chat-capable model can be used with POST /api/v1/llm/chat/completions
- an embeddings-capable model can be used with POST /api/v1/llm/embeddings
Route plans
Once the gateway resolves the requested model, it builds a route plan. A route plan describes:- the original requested model
- the request kind
- the ordered list of route candidates
- how many retry attempts are allowed for each candidate
- multiple candidates
- retry counts per candidate
- a maximum number of route attempts
- weighted selection behavior
Default routing behavior
If no custom route configuration exists, the gateway uses a straightforward behavior:- resolve the requested model
- use that resolved model as the primary candidate
- execute one attempt
Custom routing behavior
When a routing rule is configured, the gateway can define multiple candidates for a requested model. That means a single application-facing model name can map to a richer runtime strategy. A routing rule can include:- a list of candidate models
- a retry count for each candidate
- a weight for candidate selection
- a maximum number of total attempts
Weighted candidate selection
When multiple routing candidates exist, the gateway can use weighted selection to determine candidate ordering. In practical terms, this means:- some candidates can be preferred more often than others
- lower-priority candidates can still remain available as fallback options
- the gateway does not have to use a rigid fixed order every time
- gradual rollout strategies
- balancing between quality and cost
- reliability tuning
- introducing backup models without making them the primary default
Retries
Retries are applied at the route-candidate level. If a candidate is configured with multiple retry attempts, the gateway can retry that same candidate before moving on to the next one. This helps when failures are temporary, such as:- transient upstream errors
- temporary rate limiting
- short-lived provider instability
Fallbacks
Fallbacks happen when one candidate fails and the route plan includes another candidate. For example, a route plan may define:- one preferred model
- one or more backup models
Timeouts
Every request to an upstream provider runs with a timeout. Timeouts are important because production systems cannot wait indefinitely for a response. Even a correct answer becomes operationally expensive if it arrives too slowly for the workload. Agumbe supports:- a default request timeout
- provider-level timeout overrides
- model-level timeout overrides
- shorter timeouts for user-facing workloads
- slightly longer timeouts for analytical or asynchronous workloads
- special handling for a specific model that is known to be slower
Circuit breakers
Circuit breakers protect the gateway from repeatedly sending traffic to an unstable upstream target. When enabled for a provider or model, the circuit breaker tracks consecutive failures. If the failure threshold is reached, the circuit opens for a cooldown period. While the circuit is open:- the gateway does not continue sending new requests to that route target
- the request can fail fast or move through other available candidates, depending on the route plan
- one provider is experiencing a partial outage
- one model is consistently failing
- repeated retries would only add latency and error volume
What reliability protects against
Reliability controls are useful for several common production cases.Temporary provider instability
If an upstream provider has a transient error, the gateway can retry or fall back instead of immediately failing the request.Rate limit pressure
If a model or provider returns a retryable rate-limit condition, the gateway can use its configured execution plan rather than forcing application code to handle every case itself.Slow responses
If an upstream model becomes too slow, timeout rules keep your application from hanging indefinitely.Repeated failures
If one target continues failing, circuit breakers help stop the gateway from repeatedly sending traffic into a broken path.What reliability does not replace
Reliability controls improve execution behavior, but they do not remove the need for good application design. You should still:- keep your own service timeouts sensible
- handle gateway errors cleanly
- monitor latency and failure trends
- use appropriate app-level guardrails
- choose stable model aliases for production traffic
Request flow with routing and reliability
A production request typically follows this order:- authenticate the caller
- parse and validate the request
- resolve the requested model
- select the app policy
- load guardrail policy
- apply usage controls
- build the route plan
- apply request-side guardrails
- execute the route plan
- apply response-side guardrails
- log the request
- emit usage events
- return the response with timing and cost metadata
Routing and guardrails
Routing and guardrails are closely connected. For example:- a request may resolve to a model that is blocked by an app’s allowed model list
- a request may be capped by a token policy before it reaches the provider
- an app-level rate limit may block the request before any routing attempt occurs
Routing and aliases
Aliases are especially valuable when routing behavior evolves over time. A stable alias such as smart-default gives your application a durable contract. Behind that contract, the gateway can:- change the primary model
- introduce retries
- add fallback candidates
- apply weighted candidate selection
- tune timeouts
- tune circuit breakers
Observing routing behavior
Agumbe exposes timing and request metadata that help you understand how the request was processed. Successful responses may include headers such as:- x-agumbe-timing-total-ms
- x-agumbe-timing-model-resolve-ms
- x-agumbe-timing-guardrail-config-ms
- x-agumbe-timing-guardrail-input-ms
- x-agumbe-timing-provider-ms
- x-agumbe-timing-guardrail-output-ms
- x-agumbe-timing-request-log-ms
- x-agumbe-timing-usage-emit-ms
- x-agumbe-timing-side-effects-ms
- x-agumbe-timing-gateway-overhead-ms
- x-agumbe-estimated-cost-usd
- how long model resolution took
- how much time was spent with the upstream provider
- how much latency came from gateway overhead
- whether side effects such as logging and usage emission were significant
- requested model
- provider
- upstream model
- request status
- latency
- token usage
- estimated cost
- error code
Common routing and reliability errors
A few error patterns appear frequently in this part of the system.Invalid model
Returned when a model cannot be resolved or does not match the endpoint kind. Example:Route unavailable
Returned when no usable route candidate is configured for the requested model. Example:Unsupported provider
Returned when the gateway cannot execute the required capability for the selected provider target.Request timeout
Returned when the upstream request exceeds the configured timeout window.Provider error
Returned when the upstream execution fails and the gateway cannot successfully complete the route plan.Circuit open
Returned when a circuit breaker is currently open for the selected route target.Best practices
Prefer aliases for production traffic
Aliases give the gateway more room to improve routing and resilience over time without forcing application changes.Keep routing strategy centralized
Do not push model selection and fallback logic into every application service unless you have a very specific reason to do so.Tune for workload type
Different workloads need different reliability behavior. User-facing requests may need tighter timeouts. Background workflows may tolerate more retries.Use observability data
Timing headers and request logs are not just diagnostics. They help you tune route behavior based on real traffic.Keep fallback chains intentional
More fallback candidates are not always better. A smaller, well-understood route plan is easier to operate than a large, opaque one.Pair reliability with guardrails and usage controls
A resilient route is only one part of production readiness. It should work together with app-level policies, model controls, and request monitoring.Recommended starting point
If you are adopting Agumbe AI Gateway for the first time, a strong starting pattern is:- use aliases such as smart-default and embed-default
- begin with simple default routing
- add retries only where they provide clear value
- introduce fallbacks deliberately, not automatically everywhere
- set reasonable timeout expectations for your workload
- observe real latency and failure behavior before making routing more complex