On the 15th of December 2022, the SpiceDB Serverless API experienced a full outage. The on-call engineer was paged immediately due to an alert based on API error rates. Our API responses indicated that these errors were caused by connection failures. Our investigation began by discovering that our cloud provider had just rolled out an update to the underlying nodes in our Kubernetes cluster. Our workloads are all resilient to this type of infrastructure update, which made it clear to us that this change was not the problem itself, but instead had been the forcing function that revealed the problem. After further investigation, we discovered that a particular certificate used to encrypt traffic between our Ingress and our API service had been expired. Our extensive usage of the HTTP2 protocol and long-lived connections delayed our discovery of this expiration. After regenerating a new certificate, traffic began to properly flow again.
We’re sharing these technical details to give our users an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident.
We'd like to thank our users for their patience, understanding, and support during this time. We'd also like to extend a huge thanks to all of the Authzed employees that worked both in and out of working hours to resolve this incident. While conducting a post-mortem is a blameless process, it does not excuse those involved from taking responsibility. We've reflected on this event and would like to highlight the where are process went well and what next steps should be taken to improve said processes and avoid future issues, some of which have already been implemented.