Sunday, June 19, 2016

After unexpacted host reboot, Powering on a RDM attached virtual machine fails with the error: Incompatible device backing specified for device '0'

Last week one of our host unexpectedly got restarted and once the host came online we were unable to power on a VM (a passive cluster node) due to an error like,

Incompatible device backing specified for device '0'

HA didn’t restart this VM due to a VM to host-Must DRS rule.

This error occurs when LUN is not consistently mapped on hosts where primary/secondary hosts are running however here when crosschecked found everything correct (LUN Number/naa.id) on affected host.

As this was a passive node so we removed the affected drive from this VM and started this node and then started investigating the issue.

On checking the vml.id of this LUN on both hosts, found it different but the strange thing was its correct on the host in question but wrong on all other hosts in cluster. To share a LUN with different nodes, it should be consistently mapped on all hosts and should have consistent unique vml.id (VMware Legacy id) but here its different so seems the RDM disk pointer file meta data got corrupted.

You can find the vml.id of LUN as follows,

First note down/copy the identifier of LUN (naa.id) and then fire this cmd,  
#esxcli storage core device list -d naa.id

Now to fix this issue what we can do is,  remove the affected RDM disk from the both nodes and then delete the RDM pointer file from Datastore (this doesn’t affect your actual data on LUN). Now after re-scanning the hosts for Datastores, re-add the LUN as RDM drive on both nodes. Now you would be able to power on the affected node.

If due to any reason above doesn’t work then as above after removing the affected RDM drives from both nodes, follow these steps,
  1. Note the NAA_ID of the LUN.
  2. Detach RDM using vSphere client.
  3. Un-present  the LUN from host on storage array. 
  4. Rescan host storage. 
  5. Remove LUN from detached list using these commands:

    #esxcli storage core device detached list
    #esxcli storage core device detached remove -d naa.id
  6. Rescan the host storage. 
  7. Re-present LUN to host. 
  8. Now again rescan the hosts for datastores
If the LUN has been flagged as perennially reserved, this can prevent the removal from succeeding.

Run this command to remove the flag:

#esxcli storage core device setconfig -d naa.id --perennially-reserved=false

Now the command to remove the device should work.

# esxcli storage core device detached remove -d naa.id

Now cross check the vml.id on hosts and it should be same and after adding the RDM drive on nodes you will be able to power on the VM nodes.

Reference: VMware kb#  1016210

Update: Apr 2018

I didn't test it but found this work around listed in a related kb #205489
  1. While adding hard disk to additional nodes of cluster, instead of selecting Existing Hard Disk under New device drop-down menu, select RDM Disk under New device drop-down menu and click Add.
  2. Select the LUN naaid which was added to the first node of the cluster. The LUN number may be different on this host.
  3. Verify that disk got added successfully.

That’s it… :) 


No comments:

Post a Comment