Known issues
Installation Issues
An io-engine pod restarts unexpectedly with exit code 132 while mounting a PVC
The io-engine process has been sent the SIGILL signal as a result of attempting to execute an illegal instruction. This indicates that the host node's CPU does not satisfy the prerequisite instruction set level for DataCore Bolt (SSE4.2 on x86-64).
Other Issues
The io-engine pod may restart if a DiskPool is inaccessible
If the disk device used by a DiskPool becomes inaccessible or enters the offline state, the hosting io-engine pod may throw panic
error. A fix for this behavior is under investigation.
Lengthy worker node reboot times
Rebooting a node can take several minutes if it runs an application that has Bolt volumes mounted. The reason is the long default NVMe controller timeout (ctrl_loss_tmo
). The solution is to follow the best K8s practices and cordon the node ensuring there aren't any application pods running on it before the reboot. Setting the ioTimeout
storage class parameter can be used to fine-tune the timeout.
Node restarts on scheduling an application
Deploying an application pod on a worker node that hosts io-engine as well as Prometheus exporter causes that node to restart.
The issue originated because of a kernel bug. Once the volume controller disconnects, the entries under /host/sys/class/hwmon/
should get removed, which does not happen in this case (The issue was fixed via this kernel patch).
Fix: Use kernel version extra-5.31.0 or later if deploying DataCore Bolt in conjunction with the Prometheus metrics exporter.
Application pod stuck in pending
An application pod will go to the pending state if the node on which it is hosted does not have an io-engine pod deployed. Ensure that all nodes which require application pods have an io-engine pod running on them.