Blog: ESX

Recently, quiesce snapshot jobs for some customers kept showing up as failed in Veeam with the error "msg.snapshot.error-quiescingerror".

After exhausting several research options I called VMware Support and we began sifting through event log files on the server as well as looking at the VSS writers and how VMware Tools was installed.

Looking at the log files on the ESX host the server was on led to this article:

https://kb.vmware.com/s/article/2039900

A folder named backupScripts.d gets created and references a path C:\Scripts\PrepostSnap\ which is empty. Therefore the job fails. The fix is found below:

  1. Log in to the Windows virtual machine that is experiencing the issue.
  2. Navigate to C:\Program Files\VMware\VMware Tools.
  3. Rename the backupscripts.d folder to backupscripts.d.old

If that folder is not present, and/or if the job still fails, the VMware services checked.


 

I came across an issue where two ESX servers that had been running for approximately 8-9 months without a reboot suddenly showed offline status in VCenter.  Looking at the events in vCenter, it showed that the ramdisk 'TMP' was full  and could not write to file /tmp/.SapInfoSysSwap.lock.LOCK.#####.

 

I got consoled into the ESX hosts and saw that there was a log file that had consumed most of the space at /tmp/mili2d.log.  From what I read, this file would have been removed upon rebooting the ESX Host, but that was not something I wanted to have to do if I could help it.

 

I reviewed the log file and determined there to be nothing of significance inside, but it had been filling up for months until reaching the limit on both hosts.  I thought I would just remove the file and reclaim the storage space, but that didn't reclaim the space. 

 

You can check the space allocation with command "vdf -h".  Here you can see the space left on the RAM Disk.

 

In order to get the ESX host to rescan the RAM Disk, restart the management services with "services.sh restart".  After I did this, the space allocation showed available, and the ESX hosts showed online again within vCenter without having to reboot the servers.


 

I was updating ESX with a customer a few weeks ago and ran into issues. We successfully upgraded from ESXi 5.1 to 5.5 Update3 using the custom Dell ISO. We then attempted to update to the latest version of ESXi 5.5, but the host purple screened upon reboot. We decided to call VMware support to create a trouble ticket. The VMware engineer provided a simple solution for our issue, which was to press Shift+r when the Hypervisor progress bar starts loading. This takes you to a menu where you can select the previous build. The VMware article can be found here: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033604. We followed these instructions and were able to successfully boot the ESX host again.

 

I believe what caused the purple screen was that vSphere Update Manager tried to install HP updates on Dell hardware. It turns out that vSphere Update Manager does not detect what updates are actually needed, just what isn’t installed. The fix for this is to create different baselines for each brand of hardware in mixed hardware environments.

 


 

I was recently working on a project to migrate a customer from a physical server to new virtual servers on a new ESX host. I installed ESXi 6.0 Update 2 on the new physical server and delivered to the customer site. After the server was onsite, I began building my first virtual machine. Since it was the first virtual machines and vCenter was not installed yet, I downloaded the VI client and connected to the host.

While creating the first VM, I received the following warning:

"If you use this client to create a VM with this version, the VM will not have the new features and controllers in this hardware version. If you want this VM to have the full hardware features of this verison, use the vSphere Web Client to create it."

According to the warning message, I needed to use the vSphere Web Client to create a VM with the latest full hardware feature set. The vSphere Web Client is part of vCenter, so I didn’t see how this was possible because vCenter was not installed yet. VMware has been planning to obsolet the VI client and moving to the web client, so I figured this was just a push in that direction. Obviously, this doesn’t work well for customers who are just building their first virtual servers. I didn’t need the new hardware features, so I just picked Virtual Machine Version: 11 and continued building the VM.

A few days later I was curious as to what the warning message meant and decided to do some more investigation. It turns out that with ESXi 6.0 Update 2, VMware started embedding a new VMWare Embedded Host Client (EHC) in ESXi. This new Embedded Host Client is a HTML5-based tool to directly manage the ESXi host and is a replacement for the VI client. This is nice because nothing needs to be downloaded or installed to manage the ESXi host using the EHC.

Here's a screenshot of the new EHC:

Knowing that the EHC exists, I now understand what the warning message I received when using the VI client was saying. They were not necessarily saying I had to use the vSphere Web Client that uses vCenter, but rather that I could connect directly to the ESXi host using the Embedded Host Client.

The VMware Embedded Host Client can be access by going to http://IPAddressOfESXiHost/ui. More information on the VMWare Embedded Host Client can be found here: http://blogs.vmware.com/vsphere/2016/04/vsphere-6-0-update-2-whats-new.html

 

 


 

I recently updated a standalone ESXi 5.5 server through command line patching.  After the ESXi server rebooted and came back online, it showed no datastore and no access to virtual machine disks. 

I found a post about ESXi 6 updates causing similar issue when the HP Storage Array drivers had been removed during the update process. Since I still had my update logs pulled up in console window, I was able to locate a line that said "VIBs Removed: Hewlett-Packard bootbank scsi-hpsa <version>".

I was able to find a link to download drivers and transferred them to the ESXi server's /tmp directory:

http://h20564.www2.hpe.com/hpsc/swd/public/detail?swItemId=MTX_11afb713b03045a2a9508fe915

The command to install the patch was:

"esxcli software vib install -d /vmfs/volumes/datastore1/hpsa-<version>-offline_bundle-<number>.zip"

After a reboot, I had access to the datastore again and averted potential disaster!

 


 

I recently needed to move several VMDK files from a VMware datastore that had filled up due to an old snapshot. To move the first VMDK I used SSH to connect to the vSphere host, browsed to the datastore, and entered:

"cp –R /source/directory/ /dest/directory/"

to recursively copy the VMDK and snapshots to the new datastore. Because of the size of this VMDK this copy command took just over 24 hours to finish. Once it completed I unfortunately found that not only had the VMDK had been converted from thin provisioned to thick, but the snapshots had also ballooned to the size of the thick base disk.
 
It turns out that vSphere provides a much better way to copy VMDKs that will not only retain thin provisioning, but will also merge snapshots while copying. I used a command similar to the following to clone a VMDK:
 
vmkfstools -i "/vmfs/volumes/Datastore/examplevm/examplevm-000001.vmdk" "/vmfs/volumes/Datastore 2/newexamplevm/newexamplevm.vmdk" -d thin -a buslogic
 
The ‘-i’ flag tells vmkfstools that we want to clone the drive, the ‘-d’ flag specifies the disk type and the ‘-a’ flag specifies the storage adapter type (in this case SCSI with the BusLogic controller).
 
VMware has a KB on cloning VMDKs with vmkfstools, available here.


 

After we completed a customer’s upgrade to ESXi 5.5.3, their Veeam jobs started failing, with an error message stating the files for the virtual machines did not exist or were locked. Since the VMs were migrated to a new ESX host as a part of the upgrade, I thought the old hosts may have put a lock on some of the VM files for some reason, so I shut them down. After they were shut down, the jobs still failed but the error message changed saying that the backups failed because a NFC storage connection was not available.

Research of this error led me to an article (https://www.veeam.com/kb1198) which directed me to some backup log files. In these backup log files, I kept entries indicating Veeam was trying to establish a connection with the SSL server, but it failed due to an unsuccessful SSLv3 handshake since ESXi 5.5.3 disables SSLv3 due to vulnerabilities with the protocol.

Some more research led me to another Veeam KB article (https://www.veeam.com/kb2063) stating that this was a known bug with Veeam 7.0. The article says, “Veeam Backup & Replication is designed to use TLS or SSL, however a bug in parsing the list of supported SSL/TLS protocol versions within Veeam Backup & Replication when communicating with VMware causes the job to fail without attempting to use TLS,” and the solution is to upgrade to Veeam 8 update 3. Since this customer’s Veeam renewal was coming up, I went ahead and upgraded them to Veeam 9 and, after doing so, their backups started running without any issues.


 

On any VMware virtual machine running Windows 2008 or 2008 R2 that was created using v4.1, the advanced configuration parameter disk.enableUUID is set to TRUE. Basically, this enables application-level quiescence in the VM. If the VM was created on ESX prior to v4.1, the advanced configuration setting does not exist. So, if you want to get application consistency on a VADP (vStorage API style) initiated backup, it won’t happen if that setting isn’t set to TRUE. This is a problem because a number of vendors (CommVault included) don’t support this feature yet. Since it is a default for new VMs, they won’t back up correctly.

The bottom line is... make sure you are absolutely sure you are getting application consistent backups by checking the app logs on the VM when doing the backup. You may not be getting as consistent of a backup as you think.


 

A coworker and I ran up against a very interesting situation at a virtualization consulting customer's site the other day. We got an after-hours call from the customer that said he was working on the console of a new Windows 2008 virtual machine. He was trying to set the IP address on the NIC and accidentally choose the “bridge network adapters” setting. Afterwards, he was unable to get to anything in the internal network from this server and several other VMs could not communicate with the internal network either. My coworker connected via VPN just fine, but was unable to ping the vmhost2. He could ping the SBS server, one terminal server, and the ISA server. We discussed over the phone that the particular ESX server that those servers were on must have somehow gotten isolated from the network. Sure enough, when my coworker checked the NIC status on vmhost1, it showed that all NICs connected to the LAN network were disconnected. We decided to go onsite and check out what was going on. On the way out, I realized what had happened. When the two NICs got bridged on that VM, it created a loop and must have looped a BPDU and err-disabled the port. Once onsite we confirmed that the port was down and portfast was NOT enabled on that port.

So, the warning here is two fold…yes, a VM can take down the whole ESX server. And second, its best to turn on portfast for ports connected to ESX servers. They don’t understand STP anyway.


 

One of our customers had a problem with Platespin backing up a machine to their DR VMware server.  It turns out that ESX (starting in 3.5, but can include previous builds because of security patches) has a configuration file that can prevent virtual machines from booting if there is something in the virtual floppy or CD-ROM drive.  The fix is to edit the configuration files, using SSH to connect to the ESX console and edit the configuration files with vi. [more]

http://support.platespin.com/kb2/article.aspx?id=21110&query=ESX3TaskFailed