Last week one of our host unexpectedly got restarted and
once the host came online we were unable to power on a VM (a passive cluster
node) due to an error like,
HA didn’t restart this VM due to a VM to host-Must DRS rule.
Now to fix this issue what we can do is, remove the affected RDM disk from the both nodes and then delete the RDM pointer file from Datastore (this doesn’t affect your actual data on LUN). Now after re-scanning the hosts for Datastores, re-add the LUN as RDM drive on both nodes. Now you would be able to power on the affected node.
#esxcli storage core device setconfig -d naa.id --perennially-reserved=false
Now the command to remove the device should work.
# esxcli storage core device detached remove -d naa.id
Incompatible device backing specified for device '0'
HA didn’t restart this VM due to a VM to host-Must DRS rule.
This error occurs when LUN is not consistently mapped on
hosts where primary/secondary hosts are running however here when crosschecked
found everything correct (LUN Number/naa.id) on affected host.
As this was a passive node so we removed the affected drive
from this VM and started this node and then started investigating the issue.
On checking the vml.id of this LUN on both hosts, found it
different but the strange thing was its correct on the host in question but
wrong on all other hosts in cluster. To share a LUN with different nodes, it
should be consistently mapped on all hosts and should have consistent unique
vml.id (VMware Legacy id) but here its different so seems the RDM disk pointer file meta data got
corrupted.
You can find the vml.id of LUN as follows,
First note down/copy the identifier of LUN (naa.id) and then fire this cmd,
First note down/copy the identifier of LUN (naa.id) and then fire this cmd,
#esxcli storage core device list -d naa.id
Now to fix this issue what we can do is, remove the affected RDM disk from the both nodes and then delete the RDM pointer file from Datastore (this doesn’t affect your actual data on LUN). Now after re-scanning the hosts for Datastores, re-add the LUN as RDM drive on both nodes. Now you would be able to power on the affected node.
If due to any reason above doesn’t work then as above after
removing the affected RDM drives from both nodes, follow these steps,
- Note the NAA_ID of the LUN.
- Detach RDM using vSphere client.
- Un-present the LUN from host on storage array.
- Rescan host storage.
- Remove LUN from detached list
using these commands:
#esxcli storage core device detached list
#esxcli storage core device detached remove -d naa.id - Rescan the host storage.
- Re-present LUN to host.
- Now again rescan the hosts for datastores
If the LUN has
been flagged as perennially reserved, this can prevent the removal from
succeeding.
Run this command to remove the flag:
Run this command to remove the flag:
#esxcli storage core device setconfig -d naa.id --perennially-reserved=false
Now the command to remove the device should work.
# esxcli storage core device detached remove -d naa.id
Now cross check the vml.id on hosts and it should be same
and after adding the RDM drive on nodes you will be able to power on the VM
nodes.
Reference: VMware kb# 1016210
Update: Apr 2018
I didn't test it but found this work around listed in a related kb #205489
Reference: VMware kb# 1016210
Update: Apr 2018
I didn't test it but found this work around listed in a related kb #205489
- While adding hard disk to additional nodes of cluster, instead of selecting Existing Hard Disk under New device drop-down menu, select RDM Disk under New device drop-down menu and click Add.
- Select the LUN naaid which was added to the first node of the cluster. The LUN number may be different on this host.
- Verify that disk got added successfully.
That’s it… :)