Kernel bug on GCP causing CPU lockups
Resolved

We have finished scheduling mandatory maintenance updates with a 7 day deadline for services affected by the kernel bug. Updates will happen within the configured maintenance windows.

We would like to thank you for your patience with us during this incident. If you have any questions related to this incident, please reach out to our support at support@aiven.io.

Mon, Oct 31, 2022, 10:35 AM
(2 years ago)
·
Affected components

No components marked as affected

Updates

Resolved

We have finished scheduling mandatory maintenance updates with a 7 day deadline for services affected by the kernel bug. Updates will happen within the configured maintenance windows.

We would like to thank you for your patience with us during this incident. If you have any questions related to this incident, please reach out to our support at support@aiven.io.

Mon, Oct 31, 2022, 10:35 AM

Monitoring

We are scheduling mandatory maintenance updates for services affected by the kernel bug. Updates will happen within the configured maintenance windows.

Sun, Oct 30, 2022, 11:48 PM(10 hours earlier)

Monitoring

Currently, most of the affected services have been pushed a node replacement.

We are now scheduling a mandatory maintenance update for primary database nodes so that the node replacements will be done during the services' configured maintenance windows.

Fri, Oct 28, 2022, 09:41 PM(2 days earlier)

Monitoring

We have performed most of the needed node replacements for the affected services.

We are now scheduling a mandatory maintenance update for primary database nodes so that the node replacements will be done during the services' configured maintenance windows.

Fri, Oct 28, 2022, 01:41 PM(7 hours earlier)

Monitoring

we have recycled standby node, some services are still pending

Fri, Oct 28, 2022, 05:42 AM(7 hours earlier)

Monitoring

Node replacements are ongoing across affected services through mandatory maintenance updates. Fleet-wide, we continue to see stability improving from this work.

Fri, Oct 28, 2022, 01:34 AM(4 hours earlier)

Identified

We are still in the process of replacing the nodes of some affected services.

For the remainder of the affected services, we will schedule a mandatory maintenance update for them. This means that during the next maintenance window, the nodes of these affected services will be replaced.

Thu, Oct 27, 2022, 12:56 PM(12 hours earlier)

Identified

We have begun to revert the kernel version of some affected services by replacing the nodes of these services.

For the remainder of the affected services, we will schedule a mandatory maintenance update for them. This means that during the next maintenance window, the nodes of these affected services will be replaced.

Thu, Oct 27, 2022, 11:42 AM(1 hour earlier)

Identified

We are making final preparations before reverting the kernel version of some affected services. This revert operation will replace the nodes of these affected services. Please be aware that the node replacements may happen outside of your configured maintenance windows.

Thu, Oct 27, 2022, 10:35 AM(1 hour earlier)

Identified

We are still in the process of reverting the impacted services to a known stable kernel version.

New services and/or nodes will not be impacted by this kernel bug as they are already running on the stable kernel version.

Thu, Oct 27, 2022, 09:29 AM(1 hour earlier)

Identified

We are still seeing existing services being impacted by this kernel defect. We plan to revert impacted services to a kernel that is known to be stable.

New services and/or nodes will not be impacted by this kernel bug as they are already running on the stable kernel version.

Thu, Oct 27, 2022, 05:19 AM(4 hours earlier)

Identified

We have now applied the patch to all impacted services to mitigate the issues caused by this incident.

We are now in the process of implementing a permanent fix to resolve this incident.

Thu, Oct 27, 2022, 12:24 AM(4 hours earlier)

Identified

We have recently identified a kernel defect with specific swap configurations that may cause nodes to failover. A patch is currently being applied to all affected services.

Wed, Oct 26, 2022, 07:53 PM(4 hours earlier)