Troubleshooting

Explore this Page

Overview

This document provides guidance for identifying and resolving common issues encountered when deploying and operating DataCore Puls8 storage solutions, including Local Storage and Replicated Storage. It covers scenarios ranging from PVC provisioning failures to system-level incompatibilities, kernel constraints, and known behavioral limitations. You are encouraged to follow the documented workarounds and resolutions to ensure a stable and consistent experience in production and development environments.

Ensure that all system and platform prerequisites are met before troubleshooting. Refer to the Product Installation and Configuration documentation for environment-specific instructions.

Storage Provisioning and Mounting Issues

PVC Stuck in Pending State

Problem

A Persistent Volume Claim (PVC) created using the localpv-hostpath StorageClass remains in the Pending state, and no corresponding Persistent Volume (PV) is created.

Cause

The default Local PV StorageClasses use volumeBindingMode: WaitForFirstConsumer, which delays PV provisioning until the application pod is scheduled. If the pod specification includes a nodeName, the Kubernetes scheduler is bypassed, preventing volume provisioning.

Resolution

  • Deploy the application that uses the PVC to trigger volume provisioning.
  • Avoid setting the nodeName in the pod spec. Use a node selector instead:
  • Copy
    YAML
    nodeSelector:
      kubernetes.io/hostname: <desired-node-name>

Once the pod is scheduled, the PVC will be bound, and the PV will be created automatically.

All SCSI Devices Claimed in OpenShift

Problem

All SCSI devices on the node are claimed by the multipathd service, potentially disrupting volume device access.

Cause

The /etc/multipath.conf file is missing either the find_multipaths directive or an appropriate blacklist, causing multipathd to claim all available SCSI devices.

Resolution

Add the following to etc/multipath.conf:

Copy
conf
defaults {
    user_friendly_names yes
    find_multipaths yes
}

Then run the following command to refresh the multipath configuration:

Copy
Refresh Multipath Configuration
multipath -w /dev/sdc

Replace /dev/sdc with the appropriate device name.

Unable to Mount XFS File System

Problem

A volume formatted with the XFS filesystem fails to mount when used by an application.

Cause

Nodes running Linux kernel versions earlier than 5.10 may not support certain options used by newer versions of xfsprogs, resulting in mount failures.

Resolution

Upgrade the kernel on affected nodes to version 5.10 or later to ensure compatibility with newer XFS filesystem features.

Backup Failures

DataUploader Pod or Node Fails Mid-Upload

Problem

A backup operation fails partially when the datauploader pod or its node becomes unavailable while uploading snapshot data to S3.

Cause

During a namespace backup, volume snapshots are created and restored to temporary volumes. These are mounted by the datauploader pod, which uploads them to S3 using Kopia. If the pod or node goes down during upload:

  • Velero does not recreate the datauploader pod.
  • The temporary volume is deleted.
  • The DataUpload custom resource transitions to Failed.
  • The overall backup is marked as PartiallyFailed.

Resolution

There is no automatic recovery for this scenario. To recover:

  • Manually re-trigger the backup.
  • If the backup was created as part of a scheduled backup, the next scheduled job will attempt the backup again.

CSI / REST API / Core Agent Unavailable During Backup

Problem

Backup remains stuck in InProgress or fails after timeout due to snapshot creation failure. No datauploader pod is created.

Cause

If any of the CSI components, the REST API (app=api-rest), or the core agent are unavailable:

  • VolumeSnapshots may not be created.
  • snapshot.status.readyToUse = false
  • The dataupload pod is never scheduled.
  • After the default csi-snapshot-timeout of 10 minutes, the backup moves to PartiallyFailed.

Resolution

  • Verify availability of CSI controller, REST server, and core agent.
  • Restore CSI operations before the timeout (default 10 minutes) to allow the backup to proceed.
  • If timeout is exceeded, re-trigger the backup.
  • Optionally, increase the csi-snapshot-timeout when creating the backup to accommodate temporary delays.

Backup Fails Due to Insufficient Pool Capacity

Problem

Backup operation fails for one or more volumes due to lack of available storage capacity in the underlying pool.

Cause

Thick-provisioned volumes and high replica counts increase space requirements. If there is no enough capacity in the pool, snapshot creation fails, causing the backup to partially fail.

Test scenario (3-node cluster with 10 GB pool and 3-replica volumes):

Volumes Size (GB) Result
2 4 Pass
1 7 Pass
1 8 Fail
1 9 Fail

Resolution

  • Increase pool capacity before triggering the backup again.
  • In some cases, freeing space by deleting volumes before the 10-minute timeout can allow snapshots to succeed and the backup to complete.
  • Analyze pool usage patterns and snapshot failure thresholds for better capacity planning.

Backup Fails When All Replicas Are Not Online

Problem

Backup operation fails for a volume when not all of its replicas are healthy or online.

Cause

Snapshots are only created if all replicas are available. If one or more child replicas are faulted, snapshot creation fails with the following error:

Copy
Error
The number of healthy replicas does not match the expected replica count of volume '<volume-uuid>'

After the csi-snapshot-timeout (default 10 minutes), the backup enters PartiallyFailed state.

Resolution

  • Ensure all replicas are online before initiating backup.
  • If replica rebuild fails due to node count or topology constraints, consider:
    • Scaling down the volume (Example: Reducing replica count).
    • Resolving topology issues to allow rebuild.
  • Increase snapshot timeout if you expect replicas to recover shortly.

Restore Failures

DataDownloader Pod Fails During Restore

Problem

Restore operation fails when the datadownloader pod is deleted or interrupted during data download from S3.

Cause

During restore:

  • The PVC is recreated and mounted to a temporary volume.
  • The datadownloader pod retrieves volume data from the S3 backup.
  • If the pod fails (Example: Due to node issue or eviction), Velero does not recreate it or resume download.

Resolution

  • Clean up stale resources (Example: Partially restored volumes, PVCs, or custom resources).
  • Re-trigger the restore operation manually.
  • Ensure node stability and availability during restore to prevent interruptions.

Infrastructure and Platform Issues

IO Engine Fails to Start Due to IOVA Limit Error

Problem

The io-engine fails to start with the following error message:

Copy
Error
Couldn't allocate memory due to IOVA exceeding limits of current DMA mask

Cause

The host node is likely to have IOMMU enabled, which might make default DMA mask width insufficient for IOVA supported address ranges.

Resolution

To resolve the issue, configure the IOVA mode to use physical addressing by setting the following Helm variable during the installation or upgrade:

Copy
Set IOVA mode to 'pa'
--set openebs.mayastor.io_engine.envcontext=iova-mode=pa

This setting ensures that the io-engine operates in a mode compatible with the system’s DMA mask constraints.

Learn More