We were able to conclude that the Cassandra instances were prone to data loss that had Replication Factor set to 1. The RF= 1 induces a risk of data loss when the node goes away suddenly.
We recommend customers to set RF > 1 for their Cassandra instances as it is the best practice.
Also, it is generally advised to use the NetworkTopologyStrategy replication strategy when creating keyspaces to ensure replicas of your data are distributed on nodes in different Availability Zones (AZ) within the service's selected cloud.
No components marked as affected
Resolved
We were able to conclude that the Cassandra instances were prone to data loss that had Replication Factor set to 1. The RF= 1 induces a risk of data loss when the node goes away suddenly.
We recommend customers to set RF > 1 for their Cassandra instances as it is the best practice.
Also, it is generally advised to use the NetworkTopologyStrategy replication strategy when creating keyspaces to ensure replicas of your data are distributed on nodes in different Availability Zones (AZ) within the service's selected cloud.
Identified
We have been able to replicate the issue internally and are continuing to work on the permanent fix. Having a replication factor of 1 may contribute to the issue. The automatic maintenance updates have been disabled as the initial mitigation.
We still recommend customers NOT to perform any plan changes, region changes, or manual maintenance until the incident is over, as that might cause the issue to affect the cluster.
Identified
We have isolated the issue to affect Cassandra clusters with a replication factor of 1 during plan/region/maintenance updates. Clusters with higher replication factors are not affected. We are continuing to investigate a fix for any clusters still affected.
Identified
We have been able to replicate the issue internally and are continuing to work on the permanent fix. The automatic maintenance updates have been disabled as the initial mitigation. We also recommend customers NOT to perform any plan changes, region changes, or manual maintenance until the incident is over, as that might cause the issue to affect the cluster
Identified
We have identified an issue that prevents our Cassandra services from receiving data after a maintenance update or version upgrade. As a mitigation, we have disabled automatic updates on all Cassandra services while we are investigating this issue. We also encourage customers to not take manual actions such as a plan change or region change in their Cassandra services, as this may also cause the issue to affect the cluster.