Wednesday, May 9, 2018

Inconsistent LUN mapping related issues on ESXi hosts

Lately came across this issue, where for some reason storage team unmapped and re-mapped few RDM LUNs to VM host group (from Storage array side) and now the respective RDM disks connected to VMs get disappeared.
We had already re-scanned the hosts to storage change and Luns were showing as mounted on all the hosts and after spending two hours with VMware support we had also rebooted the host but that didn't make any difference.

Finally when we rebooted the cluster nodes then I found this has something to do with consistent mapping of rdm Luns across VM hosts (where the cluster nodes residing).

In order to check if a LUN is consistently mapped on all VM hosts in cluster, one need to have a look at Lun's canonical name's (naa.id) corresponding vmd.id

One can check the naa.id's corresponding vml.id by running following cmd  on host (over ssh, using putty),
esxcli storage core device list -d naa.id

So, if the naa.id is naa.60060480000190104063533030353445 then the command would be,

esxcli storage core device list -d naa.60060480000190104063533030353445


For example,  vml.02000500006006048000019010406353303035344553594d4d4554

One need to look at the fifth and sixth digits (see highlighted) of vml.id, this is hexadecimal number which represents the LUN number. On converting to decimal it should match to actual Lun number.

Now to fix this issue what we can do is,  remove the affected RDM disk from the both nodes and then delete the RDM pointer file from Datastore (this doesn’t affect your actual data on LUN). Now after re-scanning the hosts for Datastores, re-add the LUN as RDM drive on both nodes. Now you would be able to power on the affected node.

If due to any reason above doesn’t work then as above after removing the affected RDM drives from both nodes, follow these steps,
  1. Note the NAA_ID of the LUN.
  2. Detach RDM using vSphere client.
  3. Un-present  the LUN from host on storage array. 
  4. Rescan host storage. 
  5. Remove LUN from detached list using these commands:

    #esxcli storage core device detached list
    #esxcli storage core device detached remove -d naa.id
  6. Rescan the host storage. 
  7. Re-present LUN to host. 
  8. Now again rescan the hosts for datastores
Now cross check the vml.id on hosts and it should be same and after adding the RDM drive on nodes you will be able to power on the VM nodes.

Note: If the LUN has been flagged as perennially reserved, this can prevent the removal from succeeding and step 5 would fail.

Run this command to remove the flag:

#esxcli storage core device setconfig -d naa.id --perennially-reserved=false

Now the command to remove the device should work.

# esxcli storage core device detached remove -d naa.id


I had faced a related issue in past and discussed about that in following post, 

After unexpacted host reboot, Powering on a RDM attached virtual machine fails with the error: Incompatible device backing specified for device '0

That's it... :)