Saturday, January 30, 2016

vCenter Server shows ESXi host as not responding

In our environment we have two - three hosts located at different sites, and in every few days we was getting one or other host listed in vCenter inventory as “not accessible” and the VMs running on that particular host listed as disconnected in vCenter inventory. (These hosts have Esxi 5.5)
As a first step to troubleshoot this issue, tried to ping the host as well as access the VMs with success,  then connected to the host over ssh using putty and restarted the management agents and waited for some time to for host to respond on vCenter console but that didn’t happen.
Then I tried to reconnect the host to vCenter but end up with the error “cannot contact the specified host (EsxiHost0xxxx). The host may not be available on the network, a network configuration problem may exist, or the management service on this host may not be responding".When tried to connect to the host directly using vSphere client, I was able to connect to the host without any issue.
As no other host was having this issue except two- three remote hosts that means the issue is not related to vCenter server firewall/port blocking.
On checking vpxd logs I found few missed heartbeats entries as well as this kind of entries for affected host ,
As vpxd log clearly shows, this issue is related to vCenter to host connectivity and that could be due to congested network. Here what we can do as a work around to avoid this issue is, we can increase the host to vCenter heartbeat response timeout limit from 60 seconds to 120 seconds (by default Esxi host sends a heartbeat to vCenter in every 10 seconds and vCenter has time window of 60 seconds to receive it). Please remember Increasing the timeout is a short-term solution until the network issues can be resolved.

To do so, Using vSphere Client:

Connect to vCenter, Administration => vCenter Server Settings => select Advanced Settings
Now in the Key field, type: config.vpxd.heartbeat.notRespondingTimeout
In the Value field, type: 120
Click Add and then OK.
Restart the VMware vCenter Server service for changes to take effect.

Using vSphere Web Client:
Connect to vCenter Server using vSphere Web client and navigate to the vCenter Server instance
Select the Manage tab, 
and then select Advanced Settings and click on Edit, this will popup a new window,
Now in the Key field, type: config.vpxd.heartbeat.notRespondingTimeout
In the Value field, type: 120
Click Add, OK
Restart the VMware vCenter Server service for changes to take effect.

Reference: Related KB#1005757

That’s It… :)

Thursday, January 21, 2016

Newly presented LUNs are not visible on Esxi host

Today when I was migrating two MSCS cluster VMs from one host to another (cold migration), found 5 LUNs missing on target host so asked the Storage admin to present these LUNs on target host. Once the Storage admin confirmed the same, we re-scanned the Storage/HBA adapters for datastore/LUNS, after the rescan when checked for the newly added LUNs, was surprised to see on one host only two LUNs were visible while on another host all five LUNs were visible. Then I checked with storage team, they confirmed that everything is fine from their end and I might need to restart the host to make the LUNs visible.

After evacuating all the VMs, rebooted the host but even after reboot LUNs were unavailable. Later when I further investigated, the issue turned out to be related to max storage path, yes this host was already having the max storage paths that is why newly mapped LUNs were not visible.
Note: Local storage, including CD-ROMs, are counted in your total paths.

You can see how many paths are being used on a specific host by Selecting the host, going to Configuration => Storage Adapters => Storage Adapter

As most of us would be aware, the VMware vSphere Host storage path limit as of vSphere 5.x is 1024 and the maximum LUNs per host is 256  (refer to configuration maximums) and as this host was already having the max supported paths (552+471+1= 1024), thus was unable to add new LUNs/Paths. 

Fix: To fix this issue, ask the storage team to reduce the number of paths per LUN so that there are fewer than the 1024 total limit, or reduce the number of LUNs presented to the host.
Note: put the host in maintenance mode during the storage path correction and once done, re-scan the host/storage adapters for datastore/LUNs.

That’s it... :)

Tuesday, January 19, 2016

How to generate ADU report on HP ProLiant server running Esxi 5.x or 6.0

As most of us would be aware, whenever we call to HP support regarding drive, array controller battery etc related issues, the first this they ask for is Array Diagnostic Utility aka ADU report. HP offers customized installation ISOs for ESXi 5.x, 6.0 that include drivers for HP specific hardware, their Offline Bundles for hardware monitoring and useful tools like hpssacli /hpacucli for managing SmartArray adapters, it is typically little behind tthe current stable version from VMware so you might download and use VMware provided latest stable ISO image to install the esxi on host.

If you used the HP provided Esxi image then hp drivers and other management utilities would be already there. If you didn’t use the HP provided Esxi image initially then download and install the respective HP ESXi Utilities Offline Bundle for Esxi 5.1 or Esxi 5.5 version.
To generate ADU report you need either HP Array Controller CLI: HPACUCLI (HP ProLiant G9 server doesn’t support this utility) or HPSSACLI (*this utility replace HPCUCLI in version 5.5 and later).

We can use HPSSACLI command same like HPACUCLI to get the detailed configuration information for the SmartArray controller.  As per the *HP VMware Utilities User Guide, “The HPSSACLI utility supports HP ProLiant 300/500/700 and Blade servers with integrated SmartArray controllers and option controllers. The HPSSACLI application contains the ability to generate a diagnostic report of the system and its Smart Array storage configuration”.
To generate and save the ADU report, follow these steps (syntax is same for both commands, hpssacli/hpcucli),
1.  Connect to the host from console or using Putty over SSH, and browse to the /opt/hp directory (***just to see what hp tools are available there otherwise you can directly run the command)
2.  Now run below command,
/opt/hp # esxcli hpssacli cmd –q “controller all show status” this command will show you the available controllers, 
3. Now run the following command to generate an array configuration report,             
 /opt/hp # esxcli hpssacli cmd –q “controller slot=0  show config detail”

To save the report, export it using the switch >> ,

/opt/hp # esxcli hpssacli cmd –q “controller slot=0  show config detail” >>/tmp/ADUreport.txt

This report will show you most of the Array controller related information and you can download and submit it to HP support as an alternative of Array Diagnostic Utility report (As I didn't have hpcucli installed so this is what suggested by one of hp TSE).

To Generate detailed ADU report using HPACUCLI: If hpcucli is not already installed on host then first download the hpcucli offline bundle and upload it to /tmp or any other directory (using an ftp client tool like winscp). I have downloaded hpacucli-9.40-12.0.vib from given link and uploaded it to /tmp directory. Now to install it, run the following command,

#esxcli software vib install –f –v /tmp/ hpacucli-9.40-12.0.vib

Once it is installed, type the below command to make it executable

/opt/hp/hpacucli/bin # ./hpacucli

Now Execute the following command to save the ADU report on the /tmp directory

#ctrl all diag file=/tmp/ ris=on xml=on zip=on
this will generate and save the ADU report to the /tmp directory as Now you can download and send this report to HP support for further analysis.

Monday, January 18, 2016

VM not accessible/lost network connectivity after reboot

This past week as part some activity we powered off some Virtual Machines and after some time when powered on those VMs again, I was surprised when the network would not come up for two VMs. I tried to ping them but VMs were not reachable so I logged-in on one of the virtual machine via VM console to check the IP configuration etc and found Network Card was showing limited connectivity (yellow sign on NIC icon) however the IP information was correct, then tried to ping from inside the VM without success. I also the rebooted the VMs again but that didn’t fix the connectivity issue.

I did remember few months back we had faced a similar issue however at that time default gateway was not turned on and ipconfig was showing an APIPA address like 169.254.x.x. To fix that issue we had to reconnect the VM network card so tried the same here too and it worked.

To fix the issue what you need to do is, Select the affected VM and go to Edit VM settings => select the vNIC adaptor =>Deselect Connected => Now click OK to apply the setting

Now Navigate again to the Edit VM Settings => select the vNIC adaptor =>Select Connected => And click OK to apply the setting.
Once settings applied, VM came back to network again.

On another VM just to test if a cold reboot would work here, I powered off the VM and once it powered off, powered it on again and voila VM was accessible again. 

Thus we can fix this issue either by a cold reboot  or by reconnecting the virtual network card.

Update: Today I came across the same issue again, this time it was a MS Server 2012 R2 OS VM and a cold reboot didn't fix the issue, one more thing sometime in order to fix the issue, you may need to repeat the vNIC disconnect process.

Update2_25/02/2016: Sometime the above wouldn't work at all, then what you can do is: login to the affected server via VM console => Go to Network card properties and disable, re-enable the network card from inside the OS and hopefully server would come back to network again.
The other thing you can do is Change the IP assignment setting from Static to Dynamic, it would pick an IP from APIPA, then change it back to Static and it should work now.

Note: For further detail about the issue, you may refer to the related VMware KB#2012646.

That’s it... :)

Saturday, January 16, 2016

ESXi host stuck "in progress" when exiting maintenance mode

This was the first time when I came across such an issue where exit from maintenance mode was taking a long time and appearing like stuck on 15% (waited for at least 20 minutes). To see what's going on the host, connected to the host using Putty and checked the maintenance status of host using the vim-cmd command,
#vim-cmd hostsvc/hostsummary |grep inMaintenceMode  
And I was amazed to see the output, it was clearly showing the host has exited from maintenance mode while from GUI it was still showing in progress.
Then I thought vSphere client might not refreshed the task status so closed the connection and then reconnected to vC again. This time there was some progress, but still, it was taking too long to Exit from maintenance mode.

Whenever we see this kind of unusual issue, we look at the restart of host management agents. 
I have a bad experience in the past with restarting all the agents at once using restart (it takes a long time to complete), so prefer to restart host and vCenter agents individually using below commands,

#/etc/init.d/hostd restart
#/etc/init.d/vpxa restart

Now coming back to the point, this fixed the issue however the host was showing like the HA agent didn’t installed correctly so again I put the host back into the maintenance mode and once the task completed, exited from maintenance and this time there was no issue.

That’s it… :)

Friday, January 15, 2016

How to backup/restore Esxi host configuration

Esxi host configuration backup is useful in cases like your Esxi has crashed/system boot disk failed and you want to restore your Esxi instantly You might be saying why backup if the re-installation of an Esxi takes just few minutes, but wait. There is not only the installation of Esxi , but there are all the configuration files concerning virtual switches and their configuration, shared storage (datastores configurations), multipaths, local users and groups and also licensing information.

We can use one of these methods to take backup of backup ESXi configuration
  1. From local or remote console using vim-cmd
  2. From vCLI using vicfg-cfgbackup
  3. From Power CLI using the Get-VMHostFirmware Cmdlet
The last two commands along with others that perform "write" operations are only supported when you have a licensed version of Esxi. If you are using a free version then the remote commands are only available for "read-only" operations. (you may have a look HERE for more info)

Esxi Configuration backup from local or remote console using vim-cmd
Backup:- Prior to backing up your ESXi host configuration, run the following command which will synchronize the configuration changed with persistent storage:
vim-cmd hostsvc/firmware/sync_config
To backup the ESXi host configurations, run the following command,
vim-cmd hostsvc/firmware/backup_config
The above command will create backup file in /scratch/downloads directory as configBundle-HostFQDN.tgz and should output a URL, you can download the backup file by going to this URL (replace the * with HostFQDN).
Restore:- Before restoring your ESXi host configurations, you need to ensure the ESXi host is placed in maintenance mode by running the following command:
vim-cmd hostsvc/maintenance_mode_enter
To restore the ESXi host configurations, Copy the backup configuration file to a location accessible by the host and run the command and run the following command,
vim-cmd hostsvc/firmware/restore_config /tmp/configBundle.tgz,  In this case, the configuration file was copied to the host's /tmp directory.
Note: Upon completing the restore, it will automatically reboot your ESXi host.

Esxi Configuration backup using vSphere PowerCLI :
To back up the configuration data for an ESXi host using the vSphere PowerCLI, run this command:
Get-VMHostFirmware -VMHost ESXi_host_IP_address -BackupConfiguration -DestinationPath output_directory
Note: A backup file is saved in the directory specified with the -DestinationPath option.

Restore: When restoring configuration data, the build number of the host must match the build number of the host that created the backup file. Use the -force option to override this requirement.

Before proceeding with restore, make sure the host is in maintenance mode.

Now restore the configuration from the backup bundle by running the command:

Set-VMHostFirmware -VMHost ESXi_host_IP_address -Restore -SourcePath backup_file -HostUser root -HostPassword root_password
example: Set-VMHostFirmware -VMHost -Restore -SourcePath F:\configBundle-HostFQDN.tgz -HostUser root -HostPassword RootPassword

Esxi Configuration backup using the vSphere CLI :
To backup your esxi server you need to install vCLI on windows or linux or use vMA.
Using vSphere CLI for Windows, run this command: --server=ESXi_host_IP_address --username=root -s --password=root_pwd  output_file_name

Note: From vSphere CLI for Windows, ensure you are executing the command from C:\Program Files\VMware\VMware vSphere CLI\bin

ex: --server= --username=root  --password=root_pwd  -s Lab_Esxi_backup

A backup text file is saved in the current working directory where you run the vicfg-cfgbackup script. You can also specify a full output path for the file.

Restore: When restoring configuration data, the build number of the host must match the build number of the host that created the backup file or use -f option (force) to override this requirement.

To restore the configuration data for an ESXi host using the vSphere CLI for Windows:

1. Before proceeding with restore, make sure the host is in maintenance mode 
2. Now run the vicfg-cfgbackup script with the -l flag to load the host configuration from the specified backup file: --server=ESXi_host_IP_address --username=root –-pasword=root_pwd 

-l backup_file
ex:- --server= --username=root --password=root_pwd -l Lab_Esxi_backup.txt

Referance: KB#2042141

Note: There are some free tools like SLYM Software's vSphere Configuration Backup is also available for esxi backup/restore but as its a third party tool so in lab its OK to use but in production environment I would avoid it.

That's it... :)

Thursday, January 7, 2016

How to disable ssh or Esxi shell warning

Today when I tried to connect to a host using Putty, found ssh is not enabled on the host so first I enabled the ssh/remote shell access and as expected as soon as I clicked OK, there was an explanation mark on the host as well as the following warning in VM summary,
I had seen and disabled the warning many times before but still it took me some time to recall the correct option in host advanced settings so thought of making a note of it.

To suppress this warning from GUI,  what you need to do is,

Go to Host's advanced settings,
In Advanced Settings first navigate to UsrVars and then to UserVars.SuppressShellWarning
 And here change the value of UserVars.SuppressShellWarning from enable 0 to disable 1.

Click OK and you are done, the annoying warning will go away.

If you know the command you can disable the warning right from the SSH console.
To do so:

Connect to the ESXi host through SSH using root credentials and Run this command:
                              vim-cmd hostsvc/advopt/update UserVars.SuppressShellWarning long 1

 And in case if you want to enable the warning again, replace 1 with 0

                      vim-cmd hostsvc/advopt/update UserVars.SuppressShellWarning long 0

Note: As in VM settings screenshot you can see, this option disables warning for local or remote shell access so one need to make the same changes for Esxi Shell warning too.

Reference: KB# 2003637

That's it... :)

Wednesday, January 6, 2016

VM not accessible, you might also need to check datastore for space

Last weekend I got a support call from database team to check, why they lost connectivity to a database server VM. Firstly I tried to ping  this VM but it was not reachable so have to login to vCenter to see what’s wrong with that VM. First thing that I noted, VM was powered on but had a message sign on it and when tried to open VM console for further investigation, got a datastore space related popup question,
Clicked on retry but the pop-up came up again.

As this popup question suggests, there was not enough space in datastore for the VM to breath and this is because the VM was running on Snapshot (******-000001.vmdk).
Even you would see the same question in VM summary page.

When I checked the datastores where the VM disks reside, I was amazed to see this,
Here one may ask why this happened, do you not having Storage Cluster/SDRS and the answer is no we don’t (that’s a different story). Thin provisioning is also not the case here but the Snapshot of this large VM is (this VM having one or more TB disk attached). This snapshot was created by VM backup tool during backup but at the same time there was some activity going on the server so it grows unexpectedly, eaten all available datastore space and cause this issue.

(if you are thinking why the hell we are taking image level backup of database VM drives, please don’t bother to ask me as I also couldn’t find the logic of that)

So to fix this what we need to do is, check all the datastores where VM disks reside (in VM summary you would see the datastore in question with space error/warning alert), create space for VM to breath and once you are done, go to VM summary,
Select Retry option in VM question and click ok.

VM should be accessible now (now you might also take a look at backup server to see if backup was completed, if so but the snapshot is still there then delete the snapshot and if the backup job is still in running state then no worries snapshot would automatically delete once the backup complete).

Note: You may also see an open VM console MKS error during open VM console (like /vmx file not accessible or unable to open) due to space crunch in datastore.

That's it... :)

Friday, January 1, 2016

How to add RDM to Microsoft Cluster nodes without downtime

This is something that we do once in months so you might forget the process and then come across errors few time before recalling the right process at least this have happened with me more than twice so thought of making a note of the process of adding RDMs to already up and running MSCS cluster nodes.

Adding RDM to Microsoft Cluster nodes is little different from adding RDM LUN to an independent Virtual Machine.
The first part is same in both cases, you can add RDM disk while VM is powered on however in case of MSCS node one need to power off the VM before initiating the same on secondary node.

Open VM Settings => Click on Add Hardware, Select Hard disk => select Disk type as RAW device mapping =>now this the screen you can select the intended LUN,
On next screen Select, where you want to store the RDM pointer files, Next => on this screen you would select the RDM compatibility, it could be Physical or Virtual (description of both is available in screenshot), Select Physical
On next screen you would select controller => then you would see summary here, click finish and you are done.

Again go to the fist node's VM settings and select the newly added RDM drive and copy/right down the path of RDM Pointer file, do the same for any other RDM...

Now power off the node if its not already powered off(if this was the primary node, cluster resources would automatically move to another available node or move the resources to secondary node manually and power off the VM).

If you wouldn't power off the first node and try to add the rdm to secondary VM, you will get following error,
And if you power off the secondary node then you would able to add the RDMs to it however when you would try to power it on, you will end up with this error,
So if you are not already aware you would wonder, what;s the right way to do that. Here it is:

Power off the node where you have already added the RDM LUNs(if not already) and Now Add the RDMs to the Secondary Node while its a active cluster host and online.

Open VM Settings => Click on Add Hardware, Select Hard disk => now this the screen you need to select add an existing disk
On Next screen you need to provide/browse the path earlier noted/copied RDM Pointer file, 

On next screen you would select controller => then you would see summary , click finish and you are done.

That's it... :)