Serverless Platform Ingress outage

Incident Report for Authzed

Resolved

A configuration change was applied to the SpiceDB Serverless platform's primary network ingress.
The configuration was immediately noticed and rolled back as soon as possible.

There was approximately 7 minutes of outage between 9:10PM EDT and 9:17PM EDT.
The root cause was a slight version difference between Contour in our production and staging environments.
Resolution was delayed by a few minor but vital workflows that collectively consumed time:
- External ingress health-checking that runs on a larger interval than our internal health-checks
- Lack of log retention of crash-looped pods Kubernetes making `kubectl logs` workflow less ideal

We will be improving our process for vetting configuration to make sure that this cannot possibly happen again.

Posted Sep 13, 2022 - 01:10 UTC