Aiven incident: Some Aiven Kafka clusters stopped reporting Prometheus metrics
Resolved·Partial outage

We can confirm that the incident preventing some Kafka clusters from emitting their metrics via Prometheus is now resolved.

Should any further issues be encountered, please reach out to support via the Aiven console.

We apologise for the inconvenience caused by this issue, and commit to analysing and improving our platform to avoid future incidents similar to this one.

Thu, Dec 5, 2024, 04:09 PM
(4 months ago)
·
Affected components
Aiven
Updates

Resolved

We can confirm that the incident preventing some Kafka clusters from emitting their metrics via Prometheus is now resolved.

Should any further issues be encountered, please reach out to support via the Aiven console.

We apologise for the inconvenience caused by this issue, and commit to analysing and improving our platform to avoid future incidents similar to this one.

Thu, Dec 5, 2024, 04:09 PM

Monitoring

Fix has been deployed and preventative maintenance will be scheduled for potentially impacted and impacted services.

Thu, Dec 5, 2024, 05:44 AM(10 hours earlier)

Monitoring

We are still closely monitoring Aiven fleet of Kafka clusters. So far no new services were affected. We will continue to monitor and to work on a permanent fix.

Wed, Dec 4, 2024, 03:41 PM(14 hours earlier)

Monitoring

We have applied a patch to all Aiven Kafka services which previously stopped emitting metrics to their Prometheus endpoints.

We are continuing to closely monitor affected services, but we are confident that the situation should now gradually get back to normal. In case of doubt, feel free to reach out to Aiven Support.

Wed, Dec 4, 2024, 02:47 PM(53 minutes earlier)

Identified

We have identified the root cause of the partial outage affecting Prometheus metric reporting. We also identified the 7 impacted services.

We prepared an Emergency procedure to patch those services. As we are applying this patch, we will also pro-actively reach out to impacted customers.

We will provide a further regular updates about this issue.

Wed, Dec 4, 2024, 01:53 PM(53 minutes earlier)

Investigating

We are currently investigating a partial outage affecting Prometheus metric reporting of some Aiven Kafka clusters. As far as we can tell, Datadog users are not impacted.

We apologise for the inconvenience caused by this issue. We will provide regular update about our progress in resolving this issue.

Wed, Dec 4, 2024, 01:27 PM(26 minutes earlier)