Services with new nodes stuck in rebuilding phase

Resolved·Partial outage

We have identified the root causes of the incident and implemented a fix to mitigate the issues, which now allows nodes to start up normally. As a follow-up action, we will put additional measures in place to prevent this from happening again in the future.

Wed, Dec 11, 2024, 11:43 AM

(10 months ago)

Affected components

Dec 10, 2024, 06:57 PM

Dec 11, 2024, 11:43 AM

Aiven

Updates

Resolved

Wed, Dec 11, 2024, 11:43 AM

Identified

We have identified the root cause of the issue and are currently working on a permanent solution. In the meantime, we have implemented a workaround to mitigate the problem, and new nodes should now be able to launch successfully.

Wed, Dec 11, 2024, 08:52 AM(2 hours earlier)

Investigating

Unfortunately, we are still seeing DNS update failures and are investigating further.

Our engineers are currently working on a fix.

We apologise for the inconvenience caused by this issue.

Wed, Dec 11, 2024, 05:14 AM(3 hours earlier)

Monitoring

We have identified and fixed the DNS resolution issues that was effecting new nodes.

We are continuing to monitor this closely.

We apologise for the inconvenience caused by this issue.

Wed, Dec 11, 2024, 04:24 AM(49 minutes earlier)

Identified

We are still mitigating the DNS resolution issues so nodes are stuck in rebuilding longer than usual. You should notice to see some improvements soon.

Therefore, we still recommend against performing any unnecessary actions that could trigger a node replacement until this incident is fully resolved.

Wed, Dec 11, 2024, 03:08 AM(1 hour earlier)

Identified

We have been able to mitigate the DNS resolution issues so nodes stuck in syncing should begin to come online now.

We still recommend against performing any unnecessary actions that could trigger a node replacement until this incident is fully resolved.

Tue, Dec 10, 2024, 11:41 PM(3 hours earlier)

Investigating

Our incident team has initiated mitigation steps. DNS resolution is beginning to restore progressively across services. You may start seeing improvements, though full restoration will occur gradually across all affected services.

The guidance remains unchanged - please refrain from any actions that could trigger node replacement. The next update will be provided in 30 minutes.

Tue, Dec 10, 2024, 10:15 PM(1 hour earlier)

Investigating

We continue to make progress in identifying the root cause. Our investigation remains focused on DNS-related issues. The guidance remains unchanged - please refrain from any actions that could trigger node replacement. Next update will be provided in 30 minutes.

Thank you for your continued patience.

Tue, Dec 10, 2024, 09:21 PM(53 minutes earlier)

Investigating

Our investigation continues to point to DNS and we are continuing to make progress on determining the cause. Our ask remains the same, do not take any actions which may cause a node to be replaced. Please expect an update in 30 minutes.

Tue, Dec 10, 2024, 08:46 PM(35 minutes earlier)

Investigating

We confirmed that this relates to DNS and are diving deeper into the identifying the cause. Our ask remains the same, do not take any actions which may cause a node to be replaced. Please expect another update from us in 30 minutes.

Tue, Dec 10, 2024, 08:15 PM(30 minutes earlier)

Investigating

Our investigation so far points that this is related to DNS and that there is intermittency on the impact. We continue to ask that no action is taken which may cause a node to be replaced. Please expect another update in 30 minutes time.

Tue, Dec 10, 2024, 07:38 PM(37 minutes earlier)

Investigating

As our investigation continues we are finding that any new nodes are failing to start. Please do not issue plan upgrades at this time. Any currently running service will continue work as expected so long as nodes are not replaced.

Tue, Dec 10, 2024, 07:10 PM(28 minutes earlier)

Investigating

We are investigating delayed service start for new services. We are currently investigating impact and cause.

We apologise for the inconvenience caused by this issue. We will provide a further update in approximately 30 minutes.

Tue, Dec 10, 2024, 06:57 PM(12 minutes earlier)