Kernel bug on GCP causing CPU lockups
Incident Report for Aiven
Resolved
We have finished scheduling mandatory maintenance updates with a 7 day deadline for services affected by the kernel bug. Updates will happen within the configured maintenance windows.

We would like to thank you for your patience with us during this incident. If you have any questions related to this incident, please reach out to our support at support@aiven.io.
Posted Oct 31, 2022 - 10:35 UTC
Update
We are scheduling mandatory maintenance updates for services affected by the kernel bug. Updates will happen within the configured maintenance windows.
Posted Oct 30, 2022 - 23:48 UTC
Update
Currently, most of the affected services have been pushed a node replacement.

We are now scheduling a mandatory maintenance update for primary database nodes so that the node replacements will be done during the services' configured maintenance windows.
Posted Oct 28, 2022 - 21:41 UTC
Update
We have performed most of the needed node replacements for the affected services.

We are now scheduling a mandatory maintenance update for primary database nodes so that the node replacements will be done during the services' configured maintenance windows.
Posted Oct 28, 2022 - 13:41 UTC
Update
we have recycled standby node, some services are still pending
Posted Oct 28, 2022 - 05:42 UTC
Monitoring
Node replacements are ongoing across affected services through mandatory maintenance updates. Fleet-wide, we continue to see stability improving from this work.
Posted Oct 28, 2022 - 01:34 UTC
Update
We are still in the process of replacing the nodes of some affected services.

For the remainder of the affected services, we will schedule a mandatory maintenance update for them. This means that during the next maintenance window, the nodes of these affected services will be replaced.
Posted Oct 27, 2022 - 12:56 UTC
Update
We have begun to revert the kernel version of some affected services by replacing the nodes of these services.

For the remainder of the affected services, we will schedule a mandatory maintenance update for them. This means that during the next maintenance window, the nodes of these affected services will be replaced.
Posted Oct 27, 2022 - 11:42 UTC
Update
We are making final preparations before reverting the kernel version of some affected services. This revert operation will replace the nodes of these affected services. Please be aware that the node replacements may happen outside of your configured maintenance windows.
Posted Oct 27, 2022 - 10:35 UTC
Update
We are still in the process of reverting the impacted services to a known stable kernel version.

New services and/or nodes will not be impacted by this kernel bug as they are already running on the stable kernel version.
Posted Oct 27, 2022 - 09:29 UTC
Update
We are still seeing existing services being impacted by this kernel defect. We plan to revert impacted services to a kernel that is known to be stable.

New services and/or nodes will not be impacted by this kernel bug as they are already running on the stable kernel version.
Posted Oct 27, 2022 - 05:19 UTC
Update
We have now applied the patch to all impacted services to mitigate the issues caused by this incident.

We are now in the process of implementing a permanent fix to resolve this incident.
Posted Oct 27, 2022 - 00:24 UTC
Identified
We have recently identified a kernel defect with specific swap configurations that may cause nodes to failover. A patch is currently being applied to all affected services.
Posted Oct 26, 2022 - 19:53 UTC