Write-up
Ongoing incident
Executive Summary

Between 2022-06-27 11:15 UTC and 2022-06-27 13:05 UTC, Aiven customers were unable to interact with the Aiven platform via the Aiven Console or the Aiven API, including the Aiven Terraform provider and the Aiven Operator for Kubernetes. Customers would have received HTTP 526 error responses when using the Aiven Console, and untrusted certificate errors when interacting with the API directly. We sincerely apologize for the impact caused to our customers.

The cause was outdated certificate renewal configuration, resulting in a failure to successfully generate TLS certificates using Let’s Encrypt for the api.aiven.io domain. As a fallback, the internal certificate authority \(CA\) was used to issue the certificates, which are not trusted by external clients, including our CDN. Once the configuration was updated and had propagated, the renewal was attempted and the correct certificates were generated and served, restoring service.

Who was affected?

Aiven customers and partners interacting with the Aiven Console and Aiven API via the api.aiven.io endpoint.

What happened?

At 2022-06-27 11:07 UTC, we deployed a new release of our API to production, as part of our regular deployment processes. As part of this deployment process, old API nodes are replaced with new nodes. When these nodes boot, they check if new TLS certificates are needed for configured domains, similar to the Custom Domains feature.

A new certificate will only be generated if the old certificate expires in less than 14 days. At the time of the deployment, the certificate used for api.aiven.io was due to expire at 2022-07-09 03:07:11 UTC, or approximately 12 days away at the time of the incident, triggering the renewal logic.

The renewal logic attempts to renew certificates for all domains listed in the configuration. One of the configured domains was console.aiven.io, which earlier in June 2022 we had moved to a CDN, which now handles certificate renewal and TLS termination for console.aiven.io. When attempting to complete a HTTP-01 challenge for this set of domains, as console.aiven.io’s HTTP-01 token path was invalid, the certificate could not be renewed.

The fallback behavior in the event of failing to use Let’s Encrypt is to use our internal CA to issue the certificate. As this certificate is not trusted by external clients, this resulted in the HTTP 526 errors being returned when using the Aiven Console, and untrusted certificate errors when interacting with the API directly.

At 2022-06-27 12:47 UTC, the issue was identified, and at 2022-06-27 13:05 UTC our engineers had finished rolling out the updated configuration, removing the console.aiven.io domain from configuration, with service fully restored shortly after. After a period of monitoring, the incident was closed at 2022-06-27 13:32 UTC.

What will Aiven do to mitigate problems like this in the future?

We understand that the availability of the Aiven platform is important to our customers. In order to mitigate problems like this in the future:

  • More thorough test cases for our certificate renewal logic.

  • Change renewal fallback logic to use an existing valid certificate if one is found, instead of immediately using the internal CA.

  • Improve monitoring to alert on-call engineers to failures in certificate renewal.