Lifeline Controller
Explore this Page
- Overview
- Purpose of Lifeline Controller
- Opt-In via Annotation
- Install-Time Settings
- Granular Grace Period Configuration
- Operational Boundaries
- Best Practices
- Benefits of Lifeline Controller
Overview
The Lifeline Controller in DataCore Puls8 enhances Kubernetes reliability and storage resilience by automatically handling node failures in a controlled manner.
When a node becomes unresponsive, Lifeline Controller ensures that affected stateful pods are marked as Failed and their attached Persistent Volume Claims (PVC) are cleanly detached after a configurable grace period. This prevents volume lockups, enables faster recovery, and maintains application/business continuity without manual intervention. Lifeline Controller is Kubernetes-aware and evaluates nodes and pods based on a combination of annotations and Helm configuration defaults, allowing you to control which workloads participate in the failure-handling workflow.
This document explains how the Lifeline Controller works, how to enable and configure it, and the safeguards it employs to ensure safe and reliable failover operations.
Purpose of Lifeline Controller
In standard Kubernetes behavior, when a node becomes unreachable, stateful applications using ReadWriteOnce (RWO) volumes are not automatically rescheduled, and their pods may remain stuck indefinitely. This can:
- Prevent pods from being rescheduled on healthy nodes.
- Leave VolumeAttachments dangling, blocking detach and reuse.
- Delay recovery or even cause application outages.
Lifeline Controller addresses this by introducing a controlled, opt-in failure-handling workflow that:
- Monitors node readiness and reacts only after a defined grace period.
- Cleans up VolumeAttachments safely, allowing storage systems to detach and reattach as needed.
The result is faster recovery, reduced operational risk, and improved cluster reliability during node failures.
Opt-In via Annotation
Lifeline Controller is explicitly opt-in, providing fine-grained control over which nodes and pods participate in its failure-handling workflow.
Use the annotation key:
puls8.datacore.com/lifeline-target: "true"
metadata:
annotations:
puls8.datacore.com/lifeline-target: "true"
- Valid values:
"true"or"false". - Invalid or Missing Values: Ignored. Defaults are determined by Helm install-time parameters (
nodeCandidateDefaultandpodCandidateDefault). - Kubernetes Out-of-Service Taint: Nodes marked with the Kubernetes
out-of-servicetaint are excluded from Lifeline Controller processing. This ensures Lifeline Controller does not interfere with Kubernetes own handling of pods that are explicitly placed into an out-of-service maintenance workflow.
Install-Time Settings
Lifeline Controller behavior is controlled by Helm chart configuration parameters specified during DataCore Puls8 installation or upgrade.
| Parameter | Description | Default |
|---|---|---|
| gracePeriod | Time (in minutes) Lifeline Controller waits before acting on a NotReady node to avoid reacting to transient failures. | 5m |
| podCandidateDefault | The default value to be considered if the annotations are missing, invalid, or not specified on the Pod's Controller. | false |
| nodeCandidateDefault | The default value to be considered if the annotations are missing, invalid, or not specified on the Node. | true |
helm install puls8 -n puls8 --create-namespace oci://docker.io/datacoresoftware/puls8 \
--set lifeline.gracePeriod=10m \
--set lifeline.podCandidateDefault=true \
--set lifeline.nodeCandidateDefault=false
Granular Grace Period Configuration
In addition to the global grace period configured at install time, Lifeline Controller supports per-pod grace period customization using annotations. This allows you to define different failure-handling timings for specific workloads based on application requirements.
For example, some applications may require additional time to recover gracefully before being rescheduled, while others may benefit from faster failover than the cluster-wide default.
Annotation Key
Use the following annotation on the Pod’s controller (for example, Deployment, StatefulSet, or ReplicaSet) to specify a custom grace period:
puls8.datacore.com/lifeline-grace-period: "50s"
metadata:
annotations:
puls8.datacore.com/lifeline-grace-period: "50s"
Supported Values
- Accepts any valid human-readable duration format, such as:
- 10s
- 30s
- 5m
- If the annotation is missing, invalid, or not specified, Lifeline Controller falls back to the globally configured grace period (
gracePeriod) defined via Helm.
This capability enables fine-grained control over failover behavior without affecting cluster-wide defaults, allowing operators to tune recovery timing on a per-application basis.
Operational Boundaries
Lifeline Controller avoids interfering with native Kubernetes or DataCore Puls8 recovery workflows. Specifically, it does not:
- Interact with nodes using the
out-of-servicetaint flow managed by Kubernetes. - Replace or modify Kubernetes’ existing eviction or rescheduling logic.
This ensures compatibility with Kubernetes-native failover mechanisms and maintains controlled operational boundaries.
Best Practices
- Start with the default 5-minute
gracePeriodand adjust gradually based on cluster behavior and workload criticality. - Use annotations only on nodes and pods where automatic failover handling is safe and desired.
- Verify that Lifeline Controller and your CSI driver configurations are aligned before enabling in production.
- Ensure that license validation is in place to allow Lifeline Controller actions.
Benefits of Lifeline Controller
- Faster Recovery: Reduces pod and volume recovery times after node failures.
- Automated Cleanup: Prevents dangling VolumeAttachments and stalled rescheduling.
- Controlled Opt-In: Limits the workflow to explicitly approved nodes and pods.
- Operational Safety: Enforces grace periods to prevent data loss.
- Seamless Integration: Works alongside Kubernetes-native failover and DataCore Puls8 storage management components.
Learn More