Replication
Explore this Page
Overview
This document outlines the operations involved in managing replicas for Replicated PV Mayastor volumes, including provisioning, fault tolerance, recovery, and scaling operations. It provides a detailed explanation of the replication process, control plane behavior during replica failures, and the reconciliation loop that ensures the consistency and availability of data. The document also discusses scenarios where replica placement, failure handling, and volume scaling occur, with the control plane adhering to specific replica management rules to maintain the desired volume state.
Basics
When provisioning a Replicated PV Mayastor volume with a replication factor greater than one (specified by the repl parameter in the StorageClass), the control plane maintains the required number of identical data replicas (also known as "children") using a reconciliation loop similar to Kubernetes behavior. Upon volume creation, the control plane attempts to create the necessary replicas and position them within the cluster following internal heuristics. If successful, the volume becomes available and binds with the associated PVC.
If the control plane cannot find sufficient Replicated PV Mayastor Storage pools to create the required replicas, the operation will fail, and the PVC will remain unbound. In this case, Kubernetes will periodically retry the creation of the volume. If enough suitable pools are found later, the provisioning process will succeed.
Once a volume begins processing I/O, each replica will handle its I/O operations. Read requests are distributed in a round-robin fashion across replicas, while write operations are propagated to all replicas. In real-world scenarios, replicas may experience I/O failures due to transient network issues, node reboots, or hardware failures. If a replica encounters "too many" failed I/Os (a threshold not defined in this document), its status will change to "Faulted," and it will no longer accept I/O requests. The volume will be marked as "Degraded," indicating that the desired replica count has not been met.
The control plane will retire any faulted replica, making it available for garbage collection (deletion), provided the underlying storage pool remains viable. Subsequently, the control plane will attempt to restore the desired replica count by creating a new replica, following the replica placement rules. This new replica will be rebuilt by copying data from a healthy replica (the "source"). This rebuild process occurs while the volume continues to handle I/O, although it may affect disk throughput.
In cases of a clean restart of the nexus (i.e. the I/O engine pod hosting it), the control plane will reattach the healthy replicas to the nexus. If any faulted replicas are available for re-connection, the control plane will attempt to reuse them rather than creating new replicas. If a faulted replica cannot be reused, the faulted replica is retired, and a new one is created. In cases of an unclean restart (Example: A crash or forceful deletion of the I/O engine pod), only one healthy replica is reattached, and the remaining replicas will be rebuilt.
The number of replicas for a volume can be adjusted using the kubectl plugin scale
subcommand. The num_replicas
value can be increased or decreased by one, and the control plane will adjust the number of replicas accordingly, adhering to the same placement rules. If the replica count is reduced, faulted replicas will be prioritized for removal over healthy ones.
Replica Placement Heuristics
The following rules govern the placement of replicas and the handling of faulted replicas:
- Rule 1: A volume can only be provisioned if the replica count and capacity requirements of its StorageClass can be satisfied at the time of creation.
- Rule 2: Each replica of a volume must be placed on a different I/O engine node.
- Rule 3: Faulted replicas are always prioritized for retirement over online replicas.
Additionally, replicas of the same volume cannot be located in multiple pools on the same I/O engine node (as per Rule 2).
Example Scenarios
Scenario One
In a cluster with two Replicated PV Mayastor nodes, "Node-1" and "Node-2," each hosting two pools, the control plane successfully provisions a volume with two replicas. If one of the replicas experiences a hardware failure, the volume enters the "Degraded" state, and a new replica is created in a healthy pool.
Expected Behavior
- The volume maintains read/write access via the healthy replica.
- The faulted replica is retired and a new replica is created and rebuilt.
- The volume eventually returns to the "Online" state.
Scenario Two
In a cluster with three Replicated PV Mayastor nodes, "Node-1," "Node-2," and "Node-3," you create a volume with two replicas. If one of the replicas experiences I/O failures due to a SAN misconfiguration, the control plane creates a new replica on a different node (following Rule 2).
Expected Behavior
- The faulted replica is retired, and a new replica is created on Node-3.
- The volume returns to the "Online" state after the rebuild.
Scenario Three
After correcting a misconfiguration, you increase the volume's replica count from 2 to 3. The control plane attempts to reconcile the difference between the current and desired replica count.
Expected Behavior
- A new replica is created on Node-2, and the volume is rebuilt.
- The volume returns to the "Online" state after the rebuild.
Scenario Four
In a cluster with three nodes, a volume with three replicas exists. If one node goes down, the associated replica enters the "Faulted" state, and the volume remains "Degraded."
Expected Behavior
- The volume remains in the "Degraded" state, as no new replica can be created on the affected node.
- The volume cannot return to the "Online" state without the failed node.
Scenario Five
After scaling down the volume's replica count from 3 to 2, the control plane reconciles the volume's state to match the desired replica count.
Expected Behavior
The volume returns to the "Online" state after the scale-down operation.
Scenario Six
After scaling down the volume, you scale the replica count back to 3. The control plane selects an available pool on Node-3 to create the new replica.
Expected Behavior
The volume returns to the "Online" state after the scale-down operation.
- The volume enters the "Degraded" state as the replica count is reconciled.
- A new replica is created on Node-3, and the volume eventually returns to the "Online" state after the rebuild.
Benefits of Replication
- Enhanced Data Availability: Ensures continuous access to data even if a replica fails, maintaining uptime.
- Improved Fault Tolerance: Faulted replicas are automatically retired and replaced without disrupting service, ensuring resilience.
- Efficient Data Recovery: Failed replicas are rebuilt seamlessly in the background, minimizing data loss and downtime.
Learn More