Alerting
Explore this Page
- Overview
- Installation
- Alertmanager Configuration
- Alerting Rules
- Alert Evaluation and Triggering
- Best Practices
- Benefits of Alerting
Overview
Alerting in DataCore Puls8 provides real-time insights into the health and performance of storage resources by integrating with the Prometheus and Alertmanager components of the Kubernetes ecosystem. It enables automated detection of critical issues such as latency spikes, resource saturation, and abnormal behavior in both OpenEBS Local Storage and OpenEBS Replicated Storage.
The alerting system helps you take timely action by triggering customized notifications based on pre-defined thresholds and rules. DataCore Puls8’s alerting capabilities are designed for flexibility, allowing you to tailor alerts to operational standards and integrate with existing monitoring infrastructure.
Installation
The monitoring chart is included as a dependency in the DataCore Puls8 umbrella Helm chart.
Monitoring is enabled by default, and the stack installs:
- kube-prometheus-stack v70.10.0
- DataCore Puls8 specific add-ons and configurations for:
- Prometheus
- Grafana
- Alertmanager
If you want to install DataCore Puls8 without the monitoring stack (Example: If your environment already includes kube-prometheus-stack), use the following Helm command:
helm install puls8 oci://registry-1.docker.io/datacoresoftware/puls8 \
-n puls8 --create-namespace --version 4.3.0-develop \
--set monitoring.kube-prometheus-stack.install=false
This installs only DataCore Puls8-specific monitoring custom resources (CRs). If these CRs are installed in a different namespace, some additional configuration is required.
If you already have kube-prometheus-stack installed:
Prometheus Rule Selector Adjustment
To prevent the DataCore Puls8-specific rules from being ignored due to mismatched release labels, adjust the rule selector in the existing Prometheus installation:
helm upgrade <release_name> prometheus-community/kube-prometheus-stack -n monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelector.matchLabels=null \
--set prometheus.prometheusSpec.podMonitorSelector.matchLabels=null \
--set prometheus.prometheusSpec.ruleSelector.matchLabels=null
This allows Prometheus to detect and process rules regardless of label mismatches.
Handling Alertmanager from External Stack
If Alertmanager is installed separately (i.e., not managed by DataCore Puls8), you must manually integrate DataCore Puls8 alerting by adding child routes and receivers specific to DataCore Puls8 alerts in the existing Alertmanager configuration.
Alertmanager Configuration
Prometheus handles the evaluation of rules and creation of alerts, but not their delivery. Alertmanager acts as the notification system managing alert grouping, deduplication, silencing, routing, and dispatching to receivers.
The Alertmanager configuration is defined in the values.yaml file.
By default, no receivers are defined. You are expected to configure receivers based on your requirements.
monitoring:
kube-prometheus-stack:
alertmanager:
config:
global:
smtp_smarthost: 'smtp.org.com:587'
smtp_from: 'sender@org.com'
smtp_auth_username: 'sender@org.com'
smtp_auth_password: 'hAOS357*XZpqsse'
route:
receiver: team-X-mails
group_by: [alertname, engine]
routes:
- matchers:
- product="puls8"
receiver: puls8-receiver
receivers:
- name: 'team-X-mails'
email_configs:
- to: 'team-X+alerts@example.org'
send_resolved: true
- name: 'puls8-receiver'
email_configs:
- to: 'receiver@org.com'
send_resolved: true
Refer to the Prometheus Alertmanager Configuration Documentation for more details and other receiver types.
Alerting Rules
DataCore Puls8 includes Prometheus alert rules focused on OpenEBS Replicated PV Mayastor performance and capacity metrics. These rules can be modified or extended based on the needs of each organization.
Performance Rules
Performance rules monitor latency across:
- Volume Targets
- Replicas
- DiskPools
Latency metrics (read/write) are collected in a time series using counters exposed by OpenEBS Replicated PV Mayastor. These counters are stored in-memory and reset upon service restarts. Refer to the Monitoring Documentation for more information on Latency to be calculated.
- alert: MayastorDiskPoolWriteLatencyAvgHigh
expr: irate(diskpool_write_latency_us[1m]) / irate(diskpool_num_write_ops[1m]) > 500
for: 5m
labels:
severity: warning
product: puls8
engine: mayastor
annotations:
summary: "High write latency on disk pool"
description: "The write latency on disk pool {{ $labels.name }} on node {{ $labels.node }} is higher than 0.5ms."
alert: Name of the rule.expr: Calculates average write latency per operation using Prometheus irate function.for: The condition must hold for 5 minutes before the alert is triggered.labels: Categorize and filter alerts in Alertmanager.annotations: Provide summary and description for better visibility.
Performance thresholds vary with application type, workload density, and infrastructure. You can benchmark the environment before customizing thresholds.
Capacity Rules
Capacity alerts monitor DiskPool usage. The default behavior is:
- Warning alert when > 75% of capacity is consumed
- Critical alert when > 90% of capacity is consumed
- alert: MayastorDiskPoolUsage
expr: diskpool_used_size_bytes / diskpool_total_size_bytes > 0.9
for: 1m
labels:
engine: mayastor
product: puls8
severity: critical
annotations:
summary: "Critical Alert of Disk Pool Usage"
description: "Mayastor diskpool {{ $labels.name }} on node {{ $labels.node }} has exceeded 90% of total capacity."
Preconfigured Prometheus Rules
The following Prometheus rules are automatically deployed in the puls8 namespace:
puls8-monitoring-lvmlocalpv-rulespuls8-monitoring-mayastor-rulespuls8-monitoring-volume-rules
These rules define default alert conditions and thresholds for Replicated PV Mayastor, Local PV LVM, and Kubernetes Persistent Volumes (PV).
They cover key areas such as performance metrics, capacity utilization, and PV/Persistent Volume Claim (PVC) state monitoring.
Inspecting Existing Rules
To list all PrometheusRule objects in the DataCore Puls8 environment:
To view the full YAML definition for all rules:
Example output includes:
- Replicated PV Mayastor performance alerts - Average read/write latency on disk pools, volumes, and replicas (warning at 240 ms, critical at 480 ms).
- Replicated PV Mayastor capacity alerts - Disk pool usage exceeding 75% triggers a warning, exceeding 90% triggers a critical alert.
- Local PV LVM alerts - Missing physical volumes, or volume groups and thin pools exceeding 90% capacity.
- Volume/PVC alerts - Stale, pending, or lost PVCs.
Customizing Alerts
You can modify or extend these rules to align with operational requirements, such as adjusting thresholds or adding new alert definitions.
Only users with administrator privileges can modify PrometheusRule objects.
To edit an existing rule:
kubectl edit prometheusrules -n puls8 puls8-monitoring-mayastor-rules -o yaml
After saving changes, Prometheus automatically reloads the updated rule configurations. Thresholds for latency, capacity, or volume performance can be adjusted as required.
Example Alerts
Replicated PV Mayastor Performance Alerts
| Alert | Metric | Threshold | Severity |
|---|---|---|---|
| MayastorDiskPoolReadLatencyAvgHigh | diskpool_read_latency_us / diskpool_num_read_ops
|
> 240 ms (warning), > 480 ms (critical) |
Warning / Critical |
| MayastorDiskPoolWriteLatencyAvgHigh | diskpool_write_latency_us / diskpool_num_write_ops
|
> 240 ms / 480 ms | Warning / Critical |
| MayastorVolumeReadLatencyAvgHigh | volume_read_latency_us / volume_num_read_ops
|
240 / 480 ms | Warning / Critical |
| MayastorVolumeWriteLatencyAvgHigh | volume_write_latency_us / volume_num_write_ops
|
240 / 480 ms | Warning / Critical |
| MayastorReplicaReadLatencyAvgHigh | replica_read_latency_us / replica_num_read_ops
|
240 / 480 ms | Warning / Critical |
| MayastorReplicaWriteLatencyAvgHigh | replica_write_latency_us / replica_num_write_ops
|
240 / 480 ms | Warning / Critical |
Replicated PV Mayastor Capacity Alerts
| Alert | Metric | Threshold | Severity |
|---|---|---|---|
| MayastorDiskPoolUsage | diskpool_used_size_bytes / diskpool_total_size_bytes
|
> 75% (warning), > 90% (critical) |
Warning / Critical |
Local PV LVM Alerts
| Alert | Metric | Threshold | Severity |
|---|---|---|---|
| LVMVolumeGroupCapacityLow | (total - free) / total * 100
|
> 90% | Critical |
| LVMThinPoolCapacityLow | lvm_lv_used_percent{segtype="thin-pool"}
|
> 90% | Critical |
| LVMVolumeGroupMissingPhysicalVolume | lvm_vg_missing_pv_count
|
> 0 | Critical |
PVC Alerts
| Alert | Description | Severity |
|---|---|---|
| StalePersistentVolumeClaim | PVC not bound to any active Pod. | Info |
| PendingPersistentVolumeClaim | PVC pending for more than 5 minutes. | Warning |
| LostPersistentVolumeClaim | PVC lost its corresponding PV. | Warning |
Alert Evaluation and Triggering
Prometheus evaluates each alerting rule at a 30-second interval (default) within a rule group. If the rule expression holds true continuously for the defined for duration, the alert transitions from Pending to Firing and is sent to Alertmanager.
Labels in the alert help group similar alerts, and annotations provide context such as summary and description.
Best Practices
- Always back up existing
PrometheusRuleYAML files before making modifications - When creating custom alerts, duplicate and extend existing rules instead of overwriting preconfigured ones.
- Validate custom expressions using the Prometheus expression browser to ensure accuracy before deployment.
Benefits of Alerting
- Faster Issue Detection and Resolution: Reduce MTTR by acting on alerts in real time.
- Improved Reliability: Proactively manage performance and capacity issues before they impact workloads.
- Customizable: Tailor alert rules and thresholds to suit application-specific needs.
- Seamless Integration: Compatible with existing kube-prometheus-stack deployments.
Learn More