Alerting

Explore this Page

Overview

Alerting in DataCore Puls8 provides real-time insights into the health and performance of storage resources by integrating with the Prometheus and Alertmanager components of the Kubernetes ecosystem. It enables automated detection of critical issues such as latency spikes, resource saturation, and abnormal behavior in both OpenEBS Local Storage and OpenEBS Replicated Storage.

The alerting system helps you take timely action by triggering customized notifications based on pre-defined thresholds and rules. DataCore Puls8’s alerting capabilities are designed for flexibility, allowing you to tailor alerts to operational standards and integrate with existing monitoring infrastructure.

Requirements

You must have valid credentials for the datacoresoftware Docker Hub registry.

Installation

The monitoring chart is included as a dependency in the DataCore Puls8 umbrella Helm chart.

Monitoring is enabled by default, and the stack installs:

  • kube-prometheus-stack v70.10.0
  • DataCore Puls8 specific add-ons and configurations for:
    • Prometheus
    • Grafana
    • Alertmanager

If you want to install DataCore Puls8 without the monitoring stack (Example: If your environment already includes kube-prometheus-stack), use the following Helm command:

Copy
Install DataCore Puls8 without Monitoring Stack
helm install puls8 oci://registry-1.docker.io/datacoresoftware/puls8 \
  -n puls8 --create-namespace --version 4.3.0-develop \
  --set monitoring.kube-prometheus-stack.install=false

This installs only DataCore Puls8-specific monitoring custom resources (CRs). If these CRs are installed in a different namespace, some additional configuration is required.

If you already have kube-prometheus-stack installed:

Prometheus Rule Selector Adjustment

To prevent the DataCore Puls8-specific rules from being ignored due to mismatched release labels, adjust the rule selector in the existing Prometheus installation:

Copy
Adjust Prometheus Rule Selector to Accept DataCore Puls8 Rules
helm upgrade <release_name> prometheus-community/kube-prometheus-stack -n monitoring \
  --set prometheus.prometheusSpec.serviceMonitorSelector.matchLabels=null \
  --set prometheus.prometheusSpec.podMonitorSelector.matchLabels=null \
  --set prometheus.prometheusSpec.ruleSelector.matchLabels=null

This allows Prometheus to detect and process rules regardless of label mismatches.

Handling Alertmanager from External Stack

If Alertmanager is installed separately (i.e., not managed by DataCore Puls8), you must manually integrate DataCore Puls8 alerting by adding child routes and receivers specific to DataCore Puls8 alerts in the existing Alertmanager configuration.

Alertmanager Configuration

Prometheus handles the evaluation of rules and creation of alerts, but not their delivery. Alertmanager acts as the notification system managing alert grouping, deduplication, silencing, routing, and dispatching to receivers.

The Alertmanager configuration is defined in the values.yaml file.

By default, no receivers are defined. You are expected to configure receivers based on your requirements.

Copy
Sample Setup with Email Receivers
monitoring:
  kube-prometheus-stack:
    alertmanager:
      config:
        global:
          smtp_smarthost: 'smtp.org.com:587'
          smtp_from: 'sender@org.com'
          smtp_auth_username: 'sender@org.com'
          smtp_auth_password: 'hAOS357*XZpqsse'
        route:
          receiver: team-X-mails
          group_by: [alertname, engine]
          routes:
            - matchers:
                - product="puls8"
              receiver: puls8-receiver
        receivers:
          - name: 'team-X-mails'
            email_configs:
              - to: 'team-X+alerts@example.org'
                send_resolved: true
          - name: 'puls8-receiver'
            email_configs:
              - to: 'receiver@org.com'
                send_resolved: true

Refer to the Prometheus Alertmanager Configuration Documentation for more details and other receiver types.

Alerting Rules

DataCore Puls8 includes Prometheus alert rules focused on OpenEBS Replicated PV Mayastor performance and capacity metrics. These rules can be modified or extended based on the needs of each organization.

Performance Rules

Performance rules monitor latency across:

  • Volume Targets
  • Replicas
  • DiskPools

Latency metrics (read/write) are collected in a time series using counters exposed by OpenEBS Replicated PV Mayastor. These counters are stored in-memory and reset upon service restarts. Refer to the Monitoring Documentation for more information on Latency to be calculated.

Copy
Sample Rule
 - alert: MayastorDiskPoolWriteLatencyAvgHigh
  expr: irate(diskpool_write_latency_us[1m]) / irate(diskpool_num_write_ops[1m]) > 500
  for: 5m
  labels:
    severity: warning
    product: puls8
    engine: mayastor
  annotations:
    summary: "High write latency on disk pool"
    description: "The write latency on disk pool {{ $labels.name }} on node {{ $labels.node }} is higher than 0.5ms."
  • alert: Name of the rule.
  • expr: Calculates average write latency per operation using Prometheus irate function.
  • for: The condition must hold for 5 minutes before the alert is triggered.
  • labels: Categorize and filter alerts in Alertmanager.
  • annotations: Provide summary and description for better visibility.

Performance thresholds vary with application type, workload density, and infrastructure. You can benchmark the environment before customizing thresholds.

Capacity Rules

Capacity alerts monitor DiskPool usage. The default behavior is:

  • Warning alert when > 75% of capacity is consumed
  • Critical alert when > 90% of capacity is consumed
Copy
Sample Rule
- alert: MayastorDiskPoolUsage
  expr: diskpool_used_size_bytes / diskpool_total_size_bytes > 0.9
  for: 1m
  labels:
    engine: mayastor
    product: puls8
    severity: critical
  annotations:
    summary: "Critical Alert of Disk Pool Usage"
    description: "Mayastor diskpool {{ $labels.name }} on node {{ $labels.node }} has exceeded 90% of total capacity."

Alert Evaluation and Triggering

Prometheus evaluates each alerting rule at a 30-second interval (default) within a rule group. If the rule expression holds true continuously for the defined for duration, the alert transitions from Pending to Firing and is sent to Alertmanager.

Labels in the alert help group similar alerts, and annotations provide context such as summary and description.

Benefits of Alerting

  • Faster Issue Detection and Resolution: Reduce MTTR by acting on alerts in real time.
  • Improved Reliability: Proactively manage performance and capacity issues before they impact workloads.
  • Customizable: Tailor alert rules and thresholds to suit application-specific needs.
  • Seamless Integration: Compatible with existing kube-prometheus-stack deployments.

Learn More