A configuration change was applied to the SpiceDB Serverless platform's primary network ingress. The configuration was immediately noticed and rolled back as soon as possible.
There was approximately 7 minutes of outage between 9:10PM EDT and 9:17PM EDT. The root cause was a slight version difference between Contour in our production and staging environments. Resolution was delayed by a few minor but vital workflows that collectively consumed time: - External ingress health-checking that runs on a larger interval than our internal health-checks - Lack of log retention of crash-looped pods Kubernetes making `kubectl logs` workflow less ideal
We will be improving our process for vetting configuration to make sure that this cannot possibly happen again.