Skip to main content

Documentation Index

Fetch the complete documentation index at: https://agumbe.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Routing and reliability are core parts of how Agumbe AI Gateway turns raw model access into production-ready infrastructure. When your application sends a request to the gateway, the system does more than forward that request to a provider. It first resolves the model, checks the request type, determines which route candidates are available, applies policy and usage controls, and then executes the request with reliability rules such as retries, fallbacks, timeout handling, and circuit breakers. For customers integrating Agumbe into real applications, this matters because production AI systems need more than model connectivity. They need predictable behavior when providers are slow, rate-limited, temporarily unavailable, or when platform teams want to change routing strategy without rewriting application code.

Why this matters

Direct model integrations often begin simply. A service chooses one model, makes one request, and returns one response. That works for early development, but it becomes harder to operate as traffic grows. Teams eventually need answers to practical questions such as:
  • What happens if the selected model is unavailable?
  • How do we switch providers without changing application code?
  • How do we prefer one model but fall back to another?
  • How do we limit latency for one workload but allow longer execution for another?
  • How do we stop repeatedly sending traffic to an unstable upstream?
Agumbe AI Gateway addresses these concerns through a routing layer and a reliability layer. Together, these layers let you keep your application contract stable while improving execution behavior centrally.

Two important ideas

There are two related but different concepts to understand.

Routing

Routing answers the question: which model should this request go to? This includes:
  • resolving aliases into concrete model targets
  • validating whether the model supports the requested endpoint
  • selecting one or more route candidates
  • choosing ordering when multiple candidates exist

Reliability

Reliability answers the question: how should the gateway behave when execution is slow or fails? This includes:
  • request timeout handling
  • retries
  • fallbacks
  • weighted candidate selection
  • circuit breakers
Routing decides where a request should go. Reliability decides how the gateway behaves while trying to fulfill it.

How routing works

When a request reaches the gateway, the model field is resolved before the provider call is made. At a high level, the flow looks like this:
  1. the gateway reads the requested model value
  2. it checks whether that value is an alias or a direct model ID
  3. it resolves the request into a canonical model target
  4. it validates that the model supports the requested endpoint
  5. it builds a route plan
  6. it executes that plan against one or more route candidates
This means your application does not need to know how model routing is implemented internally. It sends a stable request, and the gateway handles the rest.

Model resolution

Model resolution is the first routing step. A request can specify either:
  • an Agumbe alias such as smart-default, cheap-fast, reasoning, or embed-default
  • a canonical model ID exposed by the gateway catalog
Examples:
{ "model": "smart-default" }
{ "model": "@anthropic/claude-sonnet-4" }
When the gateway receives a request, it resolves the value into a canonical model target with information such as:
  • requested model
  • canonical model
  • provider
  • upstream model
  • request kind
  • alias status
This resolution step makes the rest of execution deterministic.

Endpoint compatibility

Agumbe distinguishes between two request kinds:
  • chat
  • embeddings
This matters because the gateway does not allow a model to be used with the wrong endpoint. For example:
  • a chat-capable model can be used with POST /api/v1/llm/chat/completions
  • an embeddings-capable model can be used with POST /api/v1/llm/embeddings
If a request tries to use a model that resolves to the wrong kind, the gateway returns an invalid_model error. This validation protects applications from sending invalid traffic and makes the API behavior easier to reason about.

Route plans

Once the gateway resolves the requested model, it builds a route plan. A route plan describes:
  • the original requested model
  • the request kind
  • the ordered list of route candidates
  • how many retry attempts are allowed for each candidate
If there is no custom routing rule for the model, the route plan is simple. The gateway uses the resolved model as the single candidate and executes it once. If there is a custom routing rule, the route plan may include:
  • multiple candidates
  • retry counts per candidate
  • a maximum number of route attempts
  • weighted selection behavior
This gives platform teams a way to shape runtime execution without forcing application teams to change request code.

Default routing behavior

If no custom route configuration exists, the gateway uses a straightforward behavior:
  • resolve the requested model
  • use that resolved model as the primary candidate
  • execute one attempt
This keeps the default behavior simple and predictable. For many customers, this is enough in early integration phases.

Custom routing behavior

When a routing rule is configured, the gateway can define multiple candidates for a requested model. That means a single application-facing model name can map to a richer runtime strategy. A routing rule can include:
  • a list of candidate models
  • a retry count for each candidate
  • a weight for candidate selection
  • a maximum number of total attempts
This is useful when you want one logical model entry to support more resilient or more flexible runtime behavior.

Weighted candidate selection

When multiple routing candidates exist, the gateway can use weighted selection to determine candidate ordering. In practical terms, this means:
  • some candidates can be preferred more often than others
  • lower-priority candidates can still remain available as fallback options
  • the gateway does not have to use a rigid fixed order every time
This is useful when you want a preferred model most of the time, but still want traffic distribution or fallback coverage across additional models. Weighted routing is especially helpful for:
  • gradual rollout strategies
  • balancing between quality and cost
  • reliability tuning
  • introducing backup models without making them the primary default

Retries

Retries are applied at the route-candidate level. If a candidate is configured with multiple retry attempts, the gateway can retry that same candidate before moving on to the next one. This helps when failures are temporary, such as:
  • transient upstream errors
  • temporary rate limiting
  • short-lived provider instability
Retries improve resilience without immediately switching to a fallback model. The gateway only retries failures that are treated as retryable. In general, retryable failures are upstream execution failures, not local validation or policy errors.

Fallbacks

Fallbacks happen when one candidate fails and the route plan includes another candidate. For example, a route plan may define:
  • one preferred model
  • one or more backup models
If the preferred candidate fails with a retryable error, the gateway can move to the next candidate in the route plan. This gives customers a cleaner production story. Your application still asks for one stable model name, but the gateway can continue working through alternate execution paths when necessary. Fallback behavior is one of the strongest reasons to use aliases in production. It lets you separate the application contract from the runtime execution strategy.

Timeouts

Every request to an upstream provider runs with a timeout. Timeouts are important because production systems cannot wait indefinitely for a response. Even a correct answer becomes operationally expensive if it arrives too slowly for the workload. Agumbe supports:
  • a default request timeout
  • provider-level timeout overrides
  • model-level timeout overrides
This lets teams tune latency expectations more precisely. For example, you may want:
  • shorter timeouts for user-facing workloads
  • slightly longer timeouts for analytical or asynchronous workloads
  • special handling for a specific model that is known to be slower
Timeouts are part of the reliability layer because they prevent stuck or excessively slow upstream requests from degrading the whole system.

Circuit breakers

Circuit breakers protect the gateway from repeatedly sending traffic to an unstable upstream target. When enabled for a provider or model, the circuit breaker tracks consecutive failures. If the failure threshold is reached, the circuit opens for a cooldown period. While the circuit is open:
  • the gateway does not continue sending new requests to that route target
  • the request can fail fast or move through other available candidates, depending on the route plan
After the cooldown period, the gateway can try the target again. Circuit breakers help reduce repeated failure storms and improve overall resilience when an upstream dependency is unstable. They are especially useful when:
  • one provider is experiencing a partial outage
  • one model is consistently failing
  • repeated retries would only add latency and error volume

What reliability protects against

Reliability controls are useful for several common production cases.

Temporary provider instability

If an upstream provider has a transient error, the gateway can retry or fall back instead of immediately failing the request.

Rate limit pressure

If a model or provider returns a retryable rate-limit condition, the gateway can use its configured execution plan rather than forcing application code to handle every case itself.

Slow responses

If an upstream model becomes too slow, timeout rules keep your application from hanging indefinitely.

Repeated failures

If one target continues failing, circuit breakers help stop the gateway from repeatedly sending traffic into a broken path.

What reliability does not replace

Reliability controls improve execution behavior, but they do not remove the need for good application design. You should still:
  • keep your own service timeouts sensible
  • handle gateway errors cleanly
  • monitor latency and failure trends
  • use appropriate app-level guardrails
  • choose stable model aliases for production traffic
The gateway helps centralize execution strategy, but your application should still be built with normal production discipline.

Request flow with routing and reliability

A production request typically follows this order:
  1. authenticate the caller
  2. parse and validate the request
  3. resolve the requested model
  4. select the app policy
  5. load guardrail policy
  6. apply usage controls
  7. build the route plan
  8. apply request-side guardrails
  9. execute the route plan
  10. apply response-side guardrails
  11. log the request
  12. emit usage events
  13. return the response with timing and cost metadata
This order is important because routing and reliability do not exist in isolation. They work together with authentication, guardrails, and observability.

Routing and guardrails

Routing and guardrails are closely connected. For example:
  • a request may resolve to a model that is blocked by an app’s allowed model list
  • a request may be capped by a token policy before it reaches the provider
  • an app-level rate limit may block the request before any routing attempt occurs
This means that a request is not routed only by model logic. It is routed within the boundaries of the app’s policy. That is one reason Agumbe uses app-level guardrails and request context alongside routing logic.

Routing and aliases

Aliases are especially valuable when routing behavior evolves over time. A stable alias such as smart-default gives your application a durable contract. Behind that contract, the gateway can:
  • change the primary model
  • introduce retries
  • add fallback candidates
  • apply weighted candidate selection
  • tune timeouts
  • tune circuit breakers
This is the cleanest way to improve runtime behavior without changing application-facing model names. For most teams, this is the right production pattern: applications use aliases, and the gateway owns route behavior centrally.

Observing routing behavior

Agumbe exposes timing and request metadata that help you understand how the request was processed. Successful responses may include headers such as:
  • x-agumbe-timing-total-ms
  • x-agumbe-timing-model-resolve-ms
  • x-agumbe-timing-guardrail-config-ms
  • x-agumbe-timing-guardrail-input-ms
  • x-agumbe-timing-provider-ms
  • x-agumbe-timing-guardrail-output-ms
  • x-agumbe-timing-request-log-ms
  • x-agumbe-timing-usage-emit-ms
  • x-agumbe-timing-side-effects-ms
  • x-agumbe-timing-gateway-overhead-ms
  • x-agumbe-estimated-cost-usd
These fields make it easier to answer questions such as:
  • how long model resolution took
  • how much time was spent with the upstream provider
  • how much latency came from gateway overhead
  • whether side effects such as logging and usage emission were significant
Request logs also record useful fields such as:
  • requested model
  • provider
  • upstream model
  • request status
  • latency
  • token usage
  • estimated cost
  • error code
This gives teams practical visibility into how routing decisions play out in production.

Common routing and reliability errors

A few error patterns appear frequently in this part of the system.

Invalid model

Returned when a model cannot be resolved or does not match the endpoint kind. Example:
{
  "error": {
    "message": "Model embed-default resolves to embeddings and cannot be used with the chat endpoint",
    "type": "invalid_request_error",
    "param": "model",
    "code": "invalid_model"
  }
}

Route unavailable

Returned when no usable route candidate is configured for the requested model. Example:
{
  "error": {
    "message": "No routing candidates are configured for this model",
    "type": "api_error",
    "param": null,
    "code": "route_unavailable"
  }
}

Unsupported provider

Returned when the gateway cannot execute the required capability for the selected provider target.

Request timeout

Returned when the upstream request exceeds the configured timeout window.

Provider error

Returned when the upstream execution fails and the gateway cannot successfully complete the route plan.

Circuit open

Returned when a circuit breaker is currently open for the selected route target.

Best practices

Prefer aliases for production traffic

Aliases give the gateway more room to improve routing and resilience over time without forcing application changes.

Keep routing strategy centralized

Do not push model selection and fallback logic into every application service unless you have a very specific reason to do so.

Tune for workload type

Different workloads need different reliability behavior. User-facing requests may need tighter timeouts. Background workflows may tolerate more retries.

Use observability data

Timing headers and request logs are not just diagnostics. They help you tune route behavior based on real traffic.

Keep fallback chains intentional

More fallback candidates are not always better. A smaller, well-understood route plan is easier to operate than a large, opaque one.

Pair reliability with guardrails and usage controls

A resilient route is only one part of production readiness. It should work together with app-level policies, model controls, and request monitoring. If you are adopting Agumbe AI Gateway for the first time, a strong starting pattern is:
  • use aliases such as smart-default and embed-default
  • begin with simple default routing
  • add retries only where they provide clear value
  • introduce fallbacks deliberately, not automatically everywhere
  • set reasonable timeout expectations for your workload
  • observe real latency and failure behavior before making routing more complex
This gives you a stable foundation without over-engineering your route strategy too early.

Next steps

Once you understand routing and reliability, the next page to read is Request Logging and Observability, where you can see how request execution, timing, usage, and errors show up operationally.