Lifeline Controller

Explore this Page

Overview
Purpose of Lifeline Controller
Opt-In via Annotation
Install-Time Settings
Granular Grace Period Configuration
Operational Boundaries
Best Practices
Benefits of Lifeline Controller

Overview

The Lifeline Controller in DataCore Puls8 enhances Kubernetes reliability and storage resilience by automatically handling node failures in a controlled manner.

When a node becomes unresponsive, Lifeline Controller ensures that affected stateful pods are marked as Failed and their attached Persistent Volume Claims (PVC) are cleanly detached after a configurable grace period. This prevents volume lockups, enables faster recovery, and maintains application/business continuity without manual intervention. Lifeline Controller is Kubernetes-aware and evaluates nodes and pods based on a combination of annotations and Helm configuration defaults, allowing you to control which workloads participate in the failure-handling workflow.

This document explains how the Lifeline Controller works, how to enable and configure it, and the safeguards it employs to ensure safe and reliable failover operations.

Purpose of Lifeline Controller

In standard Kubernetes behavior, when a node becomes unreachable, stateful applications using ReadWriteOnce (RWO) volumes are not automatically rescheduled, and their pods may remain stuck indefinitely. This can:

Prevent pods from being rescheduled on healthy nodes.
Leave VolumeAttachments dangling, blocking detach and reuse.
Delay recovery or even cause application outages.

Lifeline Controller addresses this by introducing a controlled, opt-in failure-handling workflow that:

Monitors node readiness and reacts only after a defined grace period.
Cleans up VolumeAttachments safely, allowing storage systems to detach and reattach as needed.

The result is faster recovery, reduced operational risk, and improved cluster reliability during node failures.

Opt-In via Annotation

Lifeline Controller is explicitly opt-in, providing fine-grained control over which nodes and pods participate in its failure-handling workflow.

Use the annotation key:

Copy

Annotation to Opt in a Node or Pod's Controller for Lifeline Controller Monitoring

puls8.datacore.com/lifeline-target: "true"

Copy

Example: Annotating a Pod's Controller or Node

metadata:
  annotations:
    puls8.datacore.com/lifeline-target: "true"

Valid values: "true" or "false".
Invalid or Missing Values: Ignored. Defaults are determined by Helm install-time parameters (nodeCandidateDefault and podCandidateDefault).
Kubernetes Out-of-Service Taint: Nodes marked with the Kubernetes out-of-service taint are excluded from Lifeline Controller processing. This ensures Lifeline Controller does not interfere with Kubernetes own handling of pods that are explicitly placed into an out-of-service maintenance workflow.

Install-Time Settings

Lifeline Controller behavior is controlled by Helm chart configuration parameters specified during DataCore Puls8 installation or upgrade.

Parameter	Description	Default
gracePeriod	Time (in minutes) Lifeline Controller waits before acting on a NotReady node to avoid reacting to transient failures.	5m
podCandidateDefault	The default value to be considered if the annotations are missing, invalid, or not specified on the Pod's Controller.	false
nodeCandidateDefault	The default value to be considered if the annotations are missing, invalid, or not specified on the Node.	true

Copy

Example: Setting Install-Time Parameters

helm install puls8 -n puls8 --create-namespace oci://docker.io/datacoresoftware/puls8 \
  --set lifeline.gracePeriod=10m \
  --set lifeline.podCandidateDefault=true \
  --set lifeline.nodeCandidateDefault=false

Granular Grace Period Configuration

In addition to the global grace period configured at install time, Lifeline Controller supports per-pod grace period customization using annotations. This allows you to define different failure-handling timings for specific workloads based on application requirements.

For example, some applications may require additional time to recover gracefully before being rescheduled, while others may benefit from faster failover than the cluster-wide default.

Annotation Key

Use the following annotation on the Pod’s controller (for example, Deployment, StatefulSet, or ReplicaSet) to specify a custom grace period:

Copy

Annotation to Configure a Pod-Level Grace Period

puls8.datacore.com/lifeline-grace-period: "50s"

Copy

Example: Annotating a Pod’s Controller

metadata:
  annotations:
    puls8.datacore.com/lifeline-grace-period: "50s"

Supported Values

Accepts any valid human-readable duration format, such as:
- 10s
- 30s
- 5m
If the annotation is missing, invalid, or not specified, Lifeline Controller falls back to the globally configured grace period (gracePeriod) defined via Helm.

This capability enables fine-grained control over failover behavior without affecting cluster-wide defaults, allowing operators to tune recovery timing on a per-application basis.

Operational Boundaries

Lifeline Controller avoids interfering with native Kubernetes or DataCore Puls8 recovery workflows. Specifically, it does not:

Interact with nodes using the out-of-service taint flow managed by Kubernetes.
Replace or modify Kubernetes’ existing eviction or rescheduling logic.

This ensures compatibility with Kubernetes-native failover mechanisms and maintains controlled operational boundaries.

Best Practices

Start with the default 5-minute gracePeriod and adjust gradually based on cluster behavior and workload criticality.
Use annotations only on nodes and pods where automatic failover handling is safe and desired.
Verify that Lifeline Controller and your CSI driver configurations are aligned before enabling in production.
Ensure that license validation is in place to allow Lifeline Controller actions.

Benefits of Lifeline Controller

Faster Recovery: Reduces pod and volume recovery times after node failures.
Automated Cleanup: Prevents dangling VolumeAttachments and stalled rescheduling.
Controlled Opt-In: Limits the workflow to explicitly approved nodes and pods.
Operational Safety: Enforces grace periods to prevent data loss.
Seamless Integration: Works alongside Kubernetes-native failover and DataCore Puls8 storage management components.

Learn More