Known issues

Installation Issues

An io-engine pod restarts unexpectedly with exit code 132 while mounting a PVC

The io-engine process has been sent the SIGILL signal as a result of attempting to execute an illegal instruction. This indicates that the host node's CPU does not satisfy the prerequisite instruction set level for DataCore Bolt (SSE4.2 on x86-64).

Other Issues

The io-engine pod may restart if a DiskPool is inaccessible

If the disk device used by a DiskPool becomes inaccessible or enters the offline state, the hosting io-engine pod may throw panic error. A fix for this behavior is under investigation.

Lengthy worker node reboot times

Rebooting a node can take several minutes if it runs an application that has Bolt volumes mounted. The reason is the long default NVMe controller timeout (ctrl_loss_tmo). The solution is to follow the best K8s practices and cordon the node ensuring there aren't any application pods running on it before the reboot. Setting the ioTimeout storage class parameter can be used to fine-tune the timeout.

Node restarts on scheduling an application

Deploying an application pod on a worker node that hosts io-engine as well as Prometheus exporter causes that node to restart. The issue originated because of a kernel bug. Once the volume controller disconnects, the entries under /host/sys/class/hwmon/ should get removed, which does not happen in this case (The issue was fixed via this kernel patch).

Fix: Use kernel version extra-5.31.0 or later if deploying DataCore Bolt in conjunction with the Prometheus metrics exporter.

Application pod stuck in pending

An application pod will go to the pending state if the node on which it is hosted does not have an io-engine pod deployed. Ensure that all nodes which require application pods have an io-engine pod running on them.