[
Blog
]
Reliability patterns for production
Leo Fischer
-
Site Reliability Lead, Synor

AI systems often fail in production for ordinary reasons: timeouts, retries, schema drift, connector throttling, and missing guardrails. When that happens, teams blame the model, but reliability is usually determined by workflow engineering. The goal is predictable behavior when dependencies are slow, inputs are messy, and volume changes.
Treat the model as a dependency with latency, error rate, and capacity limits. Use timeouts per step, not just overall. Add circuit breakers so one slow integration does not cascade into a full outage. Retry only when it is safe, and make sure repeated runs cannot create duplicate outcomes.
Idempotency matters at scale. A workflow will run twice eventually, due to retries or duplicate events. If a second run creates a second outcome, you will spend weeks cleaning up data and trust.
Fast wins that reduce incidents
• Step level timeouts and circuit breakers
• Durable state transitions and deduplication keys
• Review queue for uncertain or high risk outputs
• Canary rollout for model and prompt changes
• Drift monitoring tied to alerting
Finally, ship AI changes like real releases. Use a small cohort canary, compare output distributions, and keep rollback paths ready. Trust is earned when the system behaves calmly under pressure.


