Troubleshooting
Explore this Page
- Overview
- Storage Provisioning and Mounting Issues
- Backup Failures
- Restore Failures
- Infrastructure and Platform Issues
Overview
This document provides guidance for identifying and resolving common issues encountered when deploying and operating DataCore Puls8 storage solutions, including Local Storage and Replicated Storage. It covers scenarios ranging from PVC provisioning failures to system-level incompatibilities, kernel constraints, and known behavioral limitations. You are encouraged to follow the documented workarounds and resolutions to ensure a stable and consistent experience in production and development environments.
Ensure that all system and platform prerequisites are met before troubleshooting. Refer to the Product Installation and Configuration documentation for environment-specific instructions.
Storage Provisioning and Mounting Issues
PVC Stuck in Pending State
Problem
A Persistent Volume Claim (PVC) created using the localpv-hostpath
StorageClass remains in the Pending
state, and no corresponding Persistent Volume (PV) is created.
Cause
The default Local PV StorageClasses use volumeBindingMode: WaitForFirstConsumer
, which delays PV provisioning until the application pod is scheduled. If the pod specification includes a nodeName
, the Kubernetes scheduler is bypassed, preventing volume provisioning.
Resolution
- Deploy the application that uses the PVC to trigger volume provisioning.
- Avoid setting the
nodeName
in the pod spec. Use a node selector instead:
Once the pod is scheduled, the PVC will be bound, and the PV will be created automatically.
All SCSI Devices Claimed in OpenShift
Problem
All SCSI devices on the node are claimed by the multipathd
service, potentially disrupting volume device access.
Cause
The /etc/multipath.conf
file is missing either the find_multipaths
directive or an appropriate blacklist, causing multipathd
to claim all available SCSI devices.
Resolution
Add the following to etc/multipath.conf
:
Then run the following command to refresh the multipath configuration:
Replace /dev/sdc
with the appropriate device name.
Unable to Mount XFS File System
Problem
A volume formatted with the XFS filesystem fails to mount when used by an application.
Cause
Nodes running Linux kernel versions earlier than 5.10 may not support certain options used by newer versions of xfsprogs
, resulting in mount failures.
Resolution
Upgrade the kernel on affected nodes to version 5.10 or later to ensure compatibility with newer XFS filesystem features.
Backup Failures
DataUploader Pod or Node Fails Mid-Upload
Problem
A backup operation fails partially when the datauploader
pod or its node becomes unavailable while uploading snapshot data to S3.
Cause
During a namespace backup, volume snapshots are created and restored to temporary volumes. These are mounted by the datauploader
pod, which uploads them to S3 using Kopia. If the pod or node goes down during upload:
- Velero does not recreate the
datauploader
pod. - The temporary volume is deleted.
- The
DataUpload
custom resource transitions toFailed
. - The overall backup is marked as
PartiallyFailed
.
Resolution
There is no automatic recovery for this scenario. To recover:
- Manually re-trigger the backup.
- If the backup was created as part of a scheduled backup, the next scheduled job will attempt the backup again.
CSI / REST API / Core Agent Unavailable During Backup
Problem
Backup remains stuck in InProgress
or fails after timeout due to snapshot creation failure. No datauploader
pod is created.
Cause
If any of the CSI components, the REST API (app=api-rest
), or the core agent are unavailable:
- VolumeSnapshots may not be created.
snapshot.status.readyToUse = false
- The
dataupload
pod is never scheduled. - After the default
csi-snapshot-timeout
of 10 minutes, the backup moves toPartiallyFailed
.
Resolution
- Verify availability of CSI controller, REST server, and core agent.
- Restore CSI operations before the timeout (default 10 minutes) to allow the backup to proceed.
- If timeout is exceeded, re-trigger the backup.
- Optionally, increase the
csi-snapshot-timeout
when creating the backup to accommodate temporary delays.
Backup Fails Due to Insufficient Pool Capacity
Problem
Backup operation fails for one or more volumes due to lack of available storage capacity in the underlying pool.
Cause
Thick-provisioned volumes and high replica counts increase space requirements. If there is no enough capacity in the pool, snapshot creation fails, causing the backup to partially fail.
Test scenario (3-node cluster with 10 GB pool and 3-replica volumes):
Volumes | Size (GB) | Result |
---|---|---|
2 | 4 | Pass |
1 | 7 | Pass |
1 | 8 | Fail |
1 | 9 | Fail |
Resolution
- Increase pool capacity before triggering the backup again.
- In some cases, freeing space by deleting volumes before the 10-minute timeout can allow snapshots to succeed and the backup to complete.
- Analyze pool usage patterns and snapshot failure thresholds for better capacity planning.
Backup Fails When All Replicas Are Not Online
Problem
Backup operation fails for a volume when not all of its replicas are healthy or online.
Cause
Snapshots are only created if all replicas are available. If one or more child replicas are faulted, snapshot creation fails with the following error:
The number of healthy replicas does not match the expected replica count of volume '<volume-uuid>'
After the csi-snapshot-timeout
(default 10 minutes), the backup enters PartiallyFailed
state.
Resolution
- Ensure all replicas are online before initiating backup.
- If replica rebuild fails due to node count or topology constraints, consider:
- Scaling down the volume (Example: Reducing replica count).
- Resolving topology issues to allow rebuild.
- Increase snapshot timeout if you expect replicas to recover shortly.
Restore Failures
DataDownloader Pod Fails During Restore
Problem
Restore operation fails when the datadownloader
pod is deleted or interrupted during data download from S3.
Cause
During restore:
- The PVC is recreated and mounted to a temporary volume.
- The
datadownloader
pod retrieves volume data from the S3 backup. - If the pod fails (Example: Due to node issue or eviction), Velero does not recreate it or resume download.
Resolution
- Clean up stale resources (Example: Partially restored volumes, PVCs, or custom resources).
- Re-trigger the restore operation manually.
- Ensure node stability and availability during restore to prevent interruptions.
Infrastructure and Platform Issues
IO Engine Fails to Start Due to IOVA Limit Error
Problem
The io-engine
fails to start with the following error message:
Cause
The host node is likely to have IOMMU enabled, which might make default DMA mask width insufficient for IOVA supported address ranges.
Resolution
To resolve the issue, configure the IOVA mode to use physical addressing by setting the following Helm variable during the installation or upgrade:
This setting ensures that the io-engine operates in a mode compatible with the system’s DMA mask constraints.
Learn More