Performance Optimization
Explore this Page
Overview
To ensure optimal performance of the Replicated PV Mayastor, you can optionally fine-tune three key configuration areas that influence efficiency, throughput, and operational consistency:
- Storage Performance Development Kit (SPDK) Blobstore Cluster Size: Adjusts the allocation unit for the SPDK blobstore. Choosing the appropriate cluster size (per pool or globally) balances storage efficiency with metadata overhead and directly impacts pool creation, import times, and large sequential I/O performance.
- Remote Direct Memory Access (RDMA) Enablement: Enables NVMe-over-Fabrics (NVMe-oF) with RDMA to deliver low-latency, high-throughput data paths. This section covers hardware prerequisites, interface validation, and TCP fallback behavior when RDMA is unavailable.
- Central Processing Unit (CPU) Isolation: Isolates CPU cores for Replicated PV Mayastor’s reactor threads to minimize scheduling interruptions and maximize I/O responsiveness. It explains kernel parameter configuration and Helm settings for dedicated core allocation.
This document explains how to configure the blobstore cluster size, enable and validate RDMA, and set up CPU isolation to achieve consistent performance and simplified management across your Kubernetes storage environment.
SPDK Blobstore Cluster Size
The SPDK Blobstore Cluster Size configuration feature helps you fine-tune storage performance and efficiency for Replicated PV Mayastor DiskPools in Kubernetes environments. By selecting an appropriate cluster size during pool creation, you can optimize on-disk layout, reduce metadata overhead, and accelerate pool import and rebuild operations, especially on large-capacity storage devices.
Blobstore Cluster Size Considerations
The blobstore cluster size determines the allocation unit size of data in blobstore backend of a Replicated PV Mayastor DiskPool.
- Smaller cluster sizes (default: 4 MiB) provide higher storage efficiency but generate more metadata overhead.
- Larger cluster sizes (for example, 16 MiB or 32 MiB) reduce metadata overhead, accelerate pool creation and import operations, and facilitate better performance for large sequential I/O workloads.
Before modifying the default setting, evaluate application I/O patterns and device capacities. Maintaining a consistent cluster size across pools simplifies replica scheduling and ongoing management.
Configuring Blobstore Cluster Size
-
Per-Pool Configuration (DiskPool Custom Resource)
Specify the
cluster_sizefield in the DiskPool custom resource manifest to configure cluster size for an individual pool. This provides granular control for specific storage devices.CopyExample: DiskPool Custom Resource with Cluster SizeapiVersion: "openebs.io/v1beta3"
kind: DiskPool
metadata:
name: <pool_name>
namespace: <namespace>
spec:
node: <node_name>
disks: ["/disk/path"]
cluster_size: 32MiB -
Global Configuration (Helm Chart)
Set a global cluster size for all new pools that do not specify it in their custom resource. Provide the size in bytes.
CopyExample: Helm Chart Variable for Global Cluster Size--set openebs.engines.replicated.mayastor.agents.core.poolClusterSize=33554432
(The value above sets the global cluster size to 32 MiB.) -
Volume Provisioning
A new StorageClass parameter,
poolClusterSize, ensures that only pools matching the specified cluster size are used when scheduling replicas.- If sufficient matching pools are unavailable, volume provisioning will fail.
- Replica rebuilds for existing volumes may also fail if matching pools cannot be located.
Best Practices
- Advanced Configuration: Changing the default cluster size (4 MiB) is intended for advanced configurations. Perform a thorough assessment of application I/O patterns and storage capacity before making adjustments.
- Validated Scale: Internal testing has verified a 32 MiB cluster size on devices up to 20 TiB, with pool import times averaging about three minutes on high-performance cloud disks.
- Operational Consistency: For simplified management and predictable replica scheduling, minimize the number of cluster sizes in your deployment. As a best practice for large-capacity environments, configure a global blobstore cluster size of 16 MiB or 32 MiB to achieve an optimal balance of performance and efficiency.
Benefits of Larger Blobstore Cluster Size
- Faster Pool Creation: When a pool is created, the device is formatted by writing metadata for every cluster. Fewer clusters mean significantly less metadata to write, leading to a significant reduction in the time it takes to create a pool on a large device.
- Quicker Pool Imports: During startup or recovery, Replicated PV Mayastor imports existing pools by reading their metadata from disk. A more compact metadata layout (due to larger clusters) requires fewer I/O operations, making the import process much quicker.
- Reduced Metadata Overhead: Larger clusters decrease the amount of metadata that SPDK must maintain.
RDMA Enablement
RDMA support in Replicated PV Mayastor enables significant improvements in storage performance by reducing latency and increasing throughput for workloads using NVMe-over-Fabrics (NVMe-oF). This feature utilizes RDMA-capable network interfaces (RNICs) to achieve high-speed, low-latency communication across nodes.
Requirements
Interface Validation
Ensure the interface specified by the io_engine.target.nvmf.iface Helm parameter exists on all io-engine nodes and is RDMA-capable. If not, those nodes will default to TCP communication.
Application Node Requirements
Application nodes must also have RDMA capable devices to establish RDMA connections. This requirement is independent of the iface parameter and specific to where the application is scheduled.
Enabling RDMA via Helm
To enable the RDMA feature via Helm:
- Set
openebs.mayastor.io_engine.target.nvmf.rdma.enabledtotrue. - Set
openebs.mayastor.io_engine.target.nvmf.ifaceto a valid network interface name that exists on an RNIC. - Verify that all nodes are properly configured with RDMA-capable hardware and that network interfaces are correctly identified and accessible.
- Once enabled, all Replicated PV Mayastor volumes will attempt RDMA connections.
- If an application runs on a non-RDMA-capable node, it will fall back to TCP unless disabled via Helm:
When fallback is disabled, pods on non-RDMA nodes will fail to connect to volumes. Either re-enable fallback or move the pods to RDMA-capable nodes.
- Software-emulated RDMA (Soft-RoCEv2) is supported on nodes without RNICs. Create a virtual RDMA device using:
GID assignment on Soft-RoCEv2 depends on CNI and cluster networking. Variability in behavior has not been fully tested.
Benefits of RDMA Enablement
- Lower Latency Data Path: RDMA eliminates kernel network stacks for storage traffic, enabling direct memory-to-memory transfers and significantly reducing I/O latency.
- Higher Throughput: By offloading network processing from CPUs to RDMA-capable NICs, Replicated PV Mayastor can sustain higher bandwidth and handle more concurrent operations.
CPU Isolation
The Replicated PV Mayastor fully utilizes each CPU core assigned to it by spawning a dedicated thread (reactor) on each. These reactor threads execute continuously, serving I/O operations without sleeping or blocking. Other threads within the I/O engine, which are not bound to specific CPUs, may block or sleep as needed.
For optimal performance, it is important that these bound reactor threads experience minimal interruptions. Ideally, they should only be interrupted by essential kernel-based time accounting processes. In practice, this is difficult to achieve, but improvements can be made using the isolcpus kernel parameter.
The isolcpus boot parameter does not prevent kernel threads or other Kubernetes pods from running on the isolated CPUs. However, it does prevent system services such as kubelet from interfering with the I/O engine's dedicated cores.
Configure Kernel Boot Parameters
Add the isolcpus kernel parameter to instruct the Linux scheduler to isolate specific CPU cores from general scheduling.
The location of the GRUB configuration file may vary depending on your Linux distribution. For example:
- Standard Linux:
/etc/default/grub - Ubuntu 20.04 on AWS EC2:
/etc/default/grub.d/50-cloudimg-settings.cfg
In this example, we isolate CPU cores 2 and 3 (on a 4-core system).
Update GRUB Configuration
After modifying the GRUB configuration file, update the bootloader to apply changes.
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/40-force-partuuid.cfg'
Sourcing file `/etc/default/grub.d/50-cloudimg-settings.cfg'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.8.0-29-generic
Found initrd image: /boot/microcode.cpio /boot/initrd.img-5.8.0-29-generic
Found linux image: /boot/vmlinuz-5.4.0-1037-aws
Found initrd image: /boot/microcode.cpio /boot/initrd.img-5.4.0-1037-aws
Found Ubuntu 20.04.2 LTS (20.04) on /dev/xvda1
done
Reboot the System
Reboot the system to enable the new kernel parameters.
Verify Isolated CPU Cores
Once the system is back online, confirm that the isolcpus parameter is active and functioning as expected.
BOOT_IMAGE=/boot/vmlinuz-5.8.0-29-generic root=PARTUUID=7213a253-01 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 isolcpus=2,3 panic=-1
Update Helm Configuration
To ensure Replicated PV Mayastor utilizes the isolated cores, update its configuration using the kubectl puls8 mayastor plugin.
Ensure that the kubectl puls8 mayastor plugin is installed and matches the Helm chart version of your deployment.
kubectl puls8 mayastor upgrade -n <namespace> --set 'openebs.mayastor.io_engine.coreList={2,3}'
CPU core indexing begins at 0. Therefore, coreList={2,3} corresponds to the third and fourth cores.
Benefits of CPU Isolation
- Consistent I/O Performance: Dedicating CPU cores to Replicated PV Mayastor’s reactor threads minimizes context switching and scheduling delays, reducing latency spikes.
- Predictable Resource Allocation: Explicitly reserving cores prevents unexpected contention from other workloads, simplifying capacity planning and performance tuning.
- Better Real-Time Responsiveness: Reactor threads can run uninterrupted, improving stability and predictability for latency-sensitive storage operations.
Learn More