We're investigating API failures
Incident Report for Authzed
Postmortem

Overview

On the 15th of December 2022, the SpiceDB Serverless API experienced a full outage. The on-call engineer was paged immediately due to an alert based on API error rates. Our API responses indicated that these errors were caused by connection failures. Our investigation began by discovering that our cloud provider had just rolled out an update to the underlying nodes in our Kubernetes cluster. Our workloads are all resilient to this type of infrastructure update, which made it clear to us that this change was not the problem itself, but instead had been the forcing function that revealed the problem. After further investigation, we discovered that a particular certificate used to encrypt traffic between our Ingress and our API service had been expired. Our extensive usage of the HTTP2 protocol and long-lived connections delayed our discovery of this expiration. After regenerating a new certificate, traffic began to properly flow again.

We’re sharing these technical details to give our users an understanding of the root cause of the problem, how we addressed it, and what we are doing to prevent similar issues from happening in the future. We would like to reiterate there was no user data loss or access by unauthorized parties of any information during the incident.

Timeline

  • 12:23 ET - Initial API Availability alert fires and the on-call is paged
  • 12:24 ET - Kubernetes Node Upgrade is discovered
  • 12:32 ET - Full outage is identified, status page is updated
  • 12:58 ET - Invalid certificate is identified in logs
  • 13:05 ET - Certificate is rotated
  • 13:07 ET - Metric-driven alerts are resolved
  • 13:08 ET - Status page is updated

Closing Thoughts & Next Steps

We'd like to thank our users for their patience, understanding, and support during this time. We'd also like to extend a huge thanks to all of the Authzed employees that worked both in and out of working hours to resolve this incident. While conducting a post-mortem is a blameless process, it does not excuse those involved from taking responsibility. We've reflected on this event and would like to highlight the where are process went well and what next steps should be taken to improve said processes and avoid future issues, some of which have already been implemented.

Things that went well

  • SREs that were not actively on-call rose to the occasion to help out the on-call
  • We had a playbook for resolving the issue

Things that could be improved

  • This Issue recently occurred on a staging cluster, but we didn’t proactively check other clusters afterwards
  • We didn’t identify the root cause as quickly as we’d like

Action Items

  • Setup metric-driven alerts for certificates expiring
  • Setup inhibitions for metric-driven alerts so that those on-call aren’t distracted by constant paging during an outage
  • Improve our playbook to include identification steps for this issue
Posted Dec 15, 2022 - 18:54 UTC

Resolved
We've monitored the service and are confident that this incident has concluded.
Posted Dec 15, 2022 - 18:54 UTC
Monitoring
We've identified the issue and rolled out a fix.
We'll continue to the monitor the situation for some minutes before we close the incident and release a post-mortem.
Posted Dec 15, 2022 - 18:09 UTC
Investigating
We are currently investigating this issue.
Posted Dec 15, 2022 - 17:32 UTC
This incident affected: Serverless Dependencies (SpiceDB Serverless gRPC API).