We're investigating API failures

Incident Report for Authzed

Postmortem

Overview

On the 15th of December 2022, the SpiceDB Serverless API experienced a full outage. The on-call engineer was paged immediately due to an alert based on API error rates. Our API responses indicated that these errors were caused by connection failures. Our investigation began by discovering that our cloud provider had just rolled out an update to the underlying nodes in our Kubernetes cluster. Our workloads are all resilient to this type of infrastructure update, which made it clear to us that this change was not the problem itself, but instead had been the forcing function that revealed the problem. After further investigation, we discovered that a particular certificate used to encrypt traffic between our Ingress and our API service had been expired. Our extensive usage of the HTTP2 protocol and long-lived connections delayed our discovery of this expiration. After regenerating a new certificate, traffic began to properly flow again.

We’re sharing these technical details to give our users an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident.

Timeline

12:23 ET - Initial API Availability alert fires and the on-call is paged
12:24 ET - Kubernetes Node Upgrade is discovered
12:32 ET - Full outage is identified, status page is updated
12:58 ET - Invalid certificate is identified in logs
13:05 ET - Certificate is rotated
13:07 ET - Metric-driven alerts are resolved
13:08 ET - Status page is updated

Closing Thoughts & Next Steps

We'd like to thank our users for their patience, understanding, and support during this time. We'd also like to extend a huge thanks to all of the Authzed employees that worked both in and out of working hours to resolve this incident. While conducting a post-mortem is a blameless process, it does not excuse those involved from taking responsibility. We've reflected on this event and would like to highlight the where are process went well and what next steps should be taken to improve said processes and avoid future issues, some of which have already been implemented.

Things that went well

SREs that were not actively on-call rose to the occasion to help out the on-call
We had a playbook for resolving the issue

Things that could be improved

This Issue recently occurred on a staging cluster, but we didn’t proactively check other clusters afterwards
We didn’t identify the root cause as quickly as we’d like

Action Items

Setup metric-driven alerts for certificates expiring
Setup inhibitions for metric-driven alerts so that those on-call aren’t distracted by constant paging during an outage
Improve our playbook to include identification steps for this issue

Posted Dec 15, 2022 - 18:54 UTC

Resolved

We've monitored the service and are confident that this incident has concluded.

Posted Dec 15, 2022 - 18:54 UTC

Monitoring

We've identified the issue and rolled out a fix.
We'll continue to the monitor the situation for some minutes before we close the incident and release a post-mortem.

Posted Dec 15, 2022 - 18:09 UTC

Investigating

We are currently investigating this issue.

Posted Dec 15, 2022 - 17:32 UTC

This incident affected: Serverless Dependencies (SpiceDB Serverless gRPC API).