Monitoring

Explore this Page

Overview

Effective monitoring is critical for maintaining the health and performance of storage infrastructure. This document outlines the metrics exposed by various exporters within the Replicated PV Mayastor ecosystem. These metrics enable observability of resource usage, performance trends, and operational statuses, thereby facilitating data-driven troubleshooting and capacity planning.

Depending on the situation, it performs either a full rebuild (restores the entire replica) or a partial rebuild (restores only the changed data). This flexibility ensures fast recovery with minimal disruption to applications.

Enable Monitoring

The DataCore Puls8 monitoring stack is enabled by default when you install DataCore Puls8 using Helm.

If you want to disable monitoring , use the following Helm flag:

Copy
Disable Monitoring Components
--set monitoring.enabled=false

This disables the installation of Prometheus, Grafana, and related monitoring components that are part of the DataCore Puls8 monitoring stack.

To collect metrics such as pool usage, volume statistics, and I/O performance, ensure that monitoring is not disabled.

Pool Metrics Exporter

The Pool Metrics Exporter runs as a sidecar container alongside each I/O Engine pod. It exposes Prometheus-compatible pool metrics via the metrics HTTP endpoint on port 9502. These metrics are refreshed every five minutes to reflect recent usage and state information.

Supported Pool Metrics

Name Type Unit Description

disk_pool_total_size_bytes

Gauge Integer Total size of the pool in bytes

disk_pool_used_size_bytes

Used size of the pool in bytes

disk_pool_status

Pool status: 0 = Unknown, 1 = Online, 2 = Degraded, 3 = Faulted

disk_pool_committed_size

Committed size of the pool in bytes

Sample Pool Metrics Output

Copy
Example output from Pool Metrics Exporter using Prometheus format
# HELP disk_pool_status Status of the disk pool (1 = healthy, 0 = unhealthy)
# TYPE disk_pool_status gauge
disk_pool_status{node="worker-0", name="mayastor-disk-pool"} 1

# HELP disk_pool_total_size_bytes Total size of the disk pool in bytes
# TYPE disk_pool_total_size_bytes gauge
disk_pool_total_size_bytes{node="worker-0", name="mayastor-disk-pool"} 5360320512

# HELP disk_pool_used_size_bytes Used size of the disk pool in bytes
# TYPE disk_pool_used_size_bytes gauge
disk_pool_used_size_bytes{node="worker-0", name="mayastor-disk-pool"} 2147483648

# HELP disk_pool_committed_size_bytes Committed size of the disk pool in bytes
# TYPE disk_pool_committed_size_bytes gauge
disk_pool_committed_size_bytes{node="worker-0", name="mayastor-disk-pool"} 9663676416

Stats Exporter Metrics

When eventing is enabled, statistics are collected by the obs-callhome-stats container within the callhome pod. These metrics are exposed on port 9090 at the /stats endpoint.

Supported Statistics Metric

Name Type Unit Description

pools_created

Gauge Integer Count of successfully created pools

pools_deleted

Count of successfully deleted pools

volumes_created

Count of successfully created volumes

volumes_deleted

Count of successfully deleted volumes

CSI Metrics Exporter

The CSI metrics exporter provides insights into volume-level statistics. These metrics are collected by kubelet and exported for Prometheus monitoring.

Supported Volume Metrics

Name Type Unit Description

kubelet_volume_stats_available_bytes

Gauge Integer Usable size of the volume in bytes

kubelet_volume_stats_capacity_bytes

Total capacity of the volume in bytes

kubelet_volume_stats_used_bytes

Amount of used space in bytes

kubelet_volume_stats_inodes

Total number of inodes
kubelet_volume_stats_inodes_free Count of available inodes
kubelet_volume_stats_inodes_used Number of inodes used for metadata

Performance Monitoring Stack

Initially, metrics exporters cached data which might not reflect real-time usage during Prometheus polls. This has been improved by directly querying the IO Engine in sync with the Prometheus polling cycle.

It is recommended to set the Prometheus poll interval to at least 5 minutes.

Accessing Grafana

Grafana provides a visual interface to monitor metrics collected by Prometheus. To access Grafana in your environment, follow these steps:

  1. Verify the Grafana Pod is running.

    Copy
    Verify Grafana Pod
    kubectl get pods -n [NAMESPACE] | grep -i grafana
  2. Check the Grafana service IP and port.

    Copy
    Find External Port Exposed by Grafana Service
    kubectl get svc -n [NAMESPACE] | grep -i grafana
  3. Access Grafana via Port-Forwarding.

    • Use port-forwarding to connect to Grafana locally if external access is not available:
    • Copy
      Connect to Grafana Locally
      kubectl port-forward --namespace [NAMESPACE] pods/[grafana-pod-name] [grafana-forward-port]:[grafana-cluster-port]
      Copy
      Example: Port Forward the Grafana Service from the Puls8 Namespace to Your Local Port 8080
      kubectl port-forward svc/puls8-grafana -n puls8 8080:80
    • Once port-forwarding is established, Open a browser and visit http://127.0.0.1:[grafana-forward-port] (Example: http://127.0.0.1:8080).
    • Use the default login credentials: Username: admin and Password: admin.
    • After logging in, the Home page is displayed.
    • To view the Puls8 dashboards, click Dashboards on the left-hand panel. For example, if you select Puls8/Replicated PV/Mayastor/Diskpool, you can view:
      • Diskpool Status
      • Diskpool Total Size
      • Diskpool Used Size
      • Diskpool Available Size
      • IOPS
      • Throughput
      • Latency

I/O Performance Metrics

DiskPool I/O Statistics

Name Type Labels Unit Description
diskpool_num_read_ops Gauge

name=<pool_id

node=<pool_node>

Integer Number of read operations on the pool

diskpool_bytes_read

Total bytes read

diskpool_num_write_ops

Number of write operations
diskpool_bytes_written Total bytes written
diskpool_read_latency_us Aggregate read latency in microseconds
diskpool_write_latency_us Aggregate write latency in microseconds

Replica I/O Statistics

Name Type Labels Unit Description
replica_num_read_ops Gauge

name=<replica_uuid>

pv_name=<pv_name>

node=<replica_node>

Integer Number of read operations on replica

replica_bytes_read

Total bytes read on the replica

replica_num_write_ops

Number of write operations
replica_bytes_written Total bytes written
replica_read_latency_us Read latency in microseconds
replica_write_latency_us Write latency in microseconds

Volume Target I/O Statistics

Name Type Labels Unit Description
volume_num_read_ops Gauge pv_name=<pv_name> Integer Number of read operations via volume

volume_bytes_read

Total bytes read via volume

volume_num_write_ops

Number of write operations via volume
volume_bytes_written Total bytes written via volume
volume_read_latency_us Read latency in microseconds
volume_write_latency_us Write latency in microseconds

Benefits of Monitoring

  • Enables real-time visibility into storage usage, performance, and system health.
  • Assists in proactive detection and resolution of issues before they impact workloads.
  • Provides historical data for capacity planning and trend analysis.
  • Facilitates compliance with SLAs and performance benchmarks.

Learn More