Blog: VMware

Here are some items to consider when upgrading to vSphere/vCenter 4.1:

  1. vCenter 4.1 requires a 64-bit OS.
  2. Windows 2008 R2 is now officially supported with vSphere 4.1.
  3. When asked what account to use for the service, the local system option was greyed out.  I had to enter my current credentials, then go back after the installation completed and change the service account to local system.
  4. The Update Manager can upgrade VM hosts.  I had to get the hosts up to version 4.0 before it would work, though.

 

I was setting up Backup Exec 12.5 to function as a VCB proxy to back up our VMs at a VMDK level and ran into a few problems. Version 12.5 has this functionality built in so it was fairly simple to back up a VM from the SAN to the VCB proxy. Restoring it back to the vCenter cluster, on the other hand, was a different story. The first problem I ran into was in running a simple restore. The job would fail as it would try and convert the machine. Simple fix: Install VMWare Converter Standalone on the proxy. [more]

Problem 2: The job would fail and give me a suggestion that I might try restoring the machine as a redirected restore job.

Problem 3: When I tried to set up the job for a redirected restore, I receive “Access is denied.” when it attempts to connect to the vCenter and datastores.

Solution: UAC was causing the access to be denied. If I started Backup Exec as an administrator or disabled UAC on the machine, I was able to get access to the datastores and set up the redirected restore. From there, my restore jobs were successful. Now I did run into other slight problems with this restore, but I’ll save that for another time.


 

A customer who has two terminal servers (TS1 & TS2) that can be accessed using a shared name (TS) was unable to access them from their remote sites. I was able to access TS1 and TS2 from a remote server but not TS. I could also connect using the IP of each server but not the shared IP. What I found was that there was a static ARP entry on the main and backup router for TS. The MAC address on the ARP entry did not match the one on the servers. Both of the servers are virtual machines and this was caused by the ESX update and installation of the updated VMTools on the terminal servers the night before. The MAC addresses on the virtual NICs had changed. The ARP entry was removed and they could connect using the shared name.


 

One of our customers uses VMware VCB backups integrated with CommVault Simpana. The CommVault job simply calls a pre-backup script to snapshot the VM and copy all the VM files to the VCB proxy, backs up the files from the proxy to the CommVault media server, then a post-backup script commits the snapshot and purges the VM files from the VCB proxy.

Recently, we upgraded this customer from VMware VI3.5 to VMware vSphere v4 Update 2. For most of the VMs that are backed up with VCB, we had no issues at all. The backups ran the weekend following the upgrade with no issues. However, all of the VMs that had been secured with the Windows Security Configuration Wizard would not back up. These VMs are in the DMZ and are locked down very tight because they host externally available web applications. The issue is that each time a backup was initiated from CommVault, the VCB script would return a non-zero error due to a snapshot failure in VMware. VMware’s error was “Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.” This would happen when using VCB scripts, but I could create a snapshot without error from the VI client. [more]

After much research and testing, I determined that the problem was hold-over from the VMTools upgrade. In the new version of VMTools, a new service is installed called VMware Snapshot Provider is installed. This service gets installed when VMTools is upgraded. Its purpose is to help facilitate application consistent snapshots through the VMTools. On the servers that were getting the “quiesced snapshot error”, this service was not present at all, but VMTools had already been updated…very strange. Here is where the Security Configuration Wizard comes in. Part of our lockdown policy is to disable a service called COM+ System Application. This service manages the configuration and tracking of COM+ based components. Apparently, without this service enabled, VMTools upgrade will NOT install the VMware Snapshot Provider service. Without the service, no quiesced snapshots and you get errors when creating snapshots via the VCB integration modules.

So why could I create a snapshot from the Vi client? Well, VMware knows that you are using VCB to create snapshots for the purpose of backup. What good would the backup be if it wasn’t app consistent? The VI client, on the other hand, will first try to create an app consistent snapshot, but if it fails or times out, it will go ahead and create the snapshot “crash consistent” without error. VCB is not as forgiving. If the guest quiesce fails, the snapshot fails…end of story. The solution was to uninstall the VMTools, reboot, temporarily enable and start the COM+ System Application service, install VMTools, then disable the COM+ System Application service. After I did that, backups have been running fine since.


 

I had a problem with VMware Workstation 7.0.1 this weekend. It is a known problem which causes the vmdk to corrupt. This has happened to me a couple times before, but in those cases I just reverted to a snapshot to fix it. This time it was too much work, so I did some research.

Turns out this has been fixed in 7.1.1 build-282343 and fusion 3.1.1  Everyone who is using Workstation 7 or Fusion 3, you should install the latest copy to avoid this issue. In case you have the problem, the fix can be found at: [more]http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1023856


 

A couple weeks ago, one of our customers had their Exchange SCR copy fail due to a corrupt log file. At first we assumed that the log file was corrupted during transit to the DR site, but after recopying the log file over multiple times and attempting to restart replication, we realized the log file was actually corrupted on the source server which is a virtual machine. I had never seen this happen before and was a little surprised that the corrupt log file had not taken the mailbox database offline. With nothing to attribute the corruption to, I decided it must have been a fluke and started a database reseed the following weekend. After 3 days, the database seeding finished, but 4 hours after the reseed completed, the SCR copied failed again…another corrupt log file. [more]

I decided there must be a bigger issue. I reviewed the logs and found numerous eventid 7 errors (bad block on disk) and a few pvscsi warnings. It seemed logical that maybe the paravirtualized SCSI adapter that was being used on this virtual machine may be causing an issue…maybe it was a weird PVSCSI / Windows 2008 server problem. I had to take a break from this issue to troubleshoot another server issue for the same customer. In doing so, I had an idea…what if the physical disk is going bad, but hadn’t completely failed. Could that cause the underlying VMware VMFS partition to look fine but cause problems with virtual disk files attached to VMs. I used iLO to check out the hardware status and sure enough one of the disks had encountered numerous SMART errors and was marked “impending failure”. The array was not degraded yet because the disk had not completely failed. I have replaced the disk and will reseed the database soon, but since replacement there have been no bad block on disk errors on this VM so it looks promising.


 

A coworker and I ran up against a very interesting situation at a virtualization consulting customer's site the other day. We got an after-hours call from the customer that said he was working on the console of a new Windows 2008 virtual machine. He was trying to set the IP address on the NIC and accidentally choose the “bridge network adapters” setting. Afterwards, he was unable to get to anything in the internal network from this server and several other VMs could not communicate with the internal network either. My coworker connected via VPN just fine, but was unable to ping the vmhost2. He could ping the SBS server, one terminal server, and the ISA server. We discussed over the phone that the particular ESX server that those servers were on must have somehow gotten isolated from the network. Sure enough, when my coworker checked the NIC status on vmhost1, it showed that all NICs connected to the LAN network were disconnected. We decided to go onsite and check out what was going on. On the way out, I realized what had happened. When the two NICs got bridged on that VM, it created a loop and must have looped a BPDU and err-disabled the port. Once onsite we confirmed that the port was down and portfast was NOT enabled on that port.

So, the warning here is two fold…yes, a VM can take down the whole ESX server. And second, its best to turn on portfast for ports connected to ESX servers. They don’t understand STP anyway.


 

I ran into another notable gotcha working with VMware View v4. I set up Windows 7 virtual machines in linked clone pools, but I was not able to get dual-monitors to work using PCoIP. After several hours of very frustrating troubleshooting, it turns out that VMware has changed the type of display driver that is included with the VMTools install on Windows 7. Prior to Windows 7, VMware used an SVGA II driver for all Windows guest OSes. With Windows 7, they are now “experimenting” with a new WDDM (Windows Display Driver Model) driver. The default VMTools install for Windows 7 uses the WDDM driver instead of the SVGA II driver. Here are some notable limitations of the WDDM driver:

  • No support for OpenGL
  • No multimonitor support
  • VM may be slow to respond or resume
  • Overlay video acceleration is disabled (basically this means flash acceleration and MMR is disabled) [more]

I’m thinking this thing isn’t fully cooked…The original article I found on this had me extract the SVGA II adapter from Workstation 7, but it appears as if new versions of the VMTools actually include it at install time, but its just not used. So, here are the instructions to revert to the SVGA II adapter so that stuff actually works!

  1. Open Device Manager from Control Panel
  2. Expand Display Adapters entry
  3. Right click on VMWare SVGA 3D (WDDM) and click properties
  4. Click on Uninstall Button
  5. Check the “Delete the driver software for this device” option
  6. Click OK
  7. Your screen may flicker as the driver is removed.  
  8. Close Device Manager and reboot Windows 7.
  9. Windows will default to the Standard VGA device
  10. Open Device Manager, expand Display Adapters
  11. Right Click Standard VGA and select Properties
  12. Click on Update Driver
  13. Click on Browse my Computer 
  14. Browse to directory C:\Program Files\Common Files\VMware\Drivers\video
  15. Click Next
  16. Confirm driver installation
  17. Close window and reboot

 

One of our customers had a problem with Platespin backing up a machine to their DR VMware server.  It turns out that ESX (starting in 3.5, but can include previous builds because of security patches) has a configuration file that can prevent virtual machines from booting if there is something in the virtual floppy or CD-ROM drive.  The fix is to edit the configuration files, using SSH to connect to the ESX console and edit the configuration files with vi. [more]

http://support.platespin.com/kb2/article.aspx?id=21110&query=ESX3TaskFailed


 

I learned the reason that VMware suggests having service consoles for ESX hosts on at least two distinct networks last week. I was troubleshooting intermittent backup issues with Veeam on a customer network and couldn’t really find any pattern to the failures. Two or three backups in a row would run successfully, then 5 in row might fail. The behavior was very random. However, the failures were always on Virtual Machines associated with a specific ESX host. At first I thought the host was healthy, but after watching the VI client for an extended period of time, I noticed that the ESX host would drop offline (showing disconnected in the VI client) and then come back online again.  This indicated the problem wasn’t just affecting the management/backup server. [more]

In order to level set my troubleshooting efforts, I decided to reboot this ESX host. However, after the reboot, I could not connect to it with the VI client. I could ping the IP assigned to the service console, but couldn’t SSH or connect via the VI client. I logged in via iLO and found that an ifconfig at the command line returned IP = 0.0.0.0…..interesting. So what is responding to my pings. I checked the arp cache on one of the switches and found that a thin client had been plugged in that had the same IP as my LAN service console. What is really odd is the MAC address for the thin client was all zeros AND the IP I was using for the LAN service console is not even available to be distributed by DHCP. I was not able to connect to the thin client to see how it was configured, but I was able to connect to ESX host via a second service console port that I placed on the iSCSI network. The management/backup server has a connection to the iSCSI network to do backups to disk so I was able to change the LAN-facing service console IP to another IP and everything started working fine. The backup issue was obviously being caused by changes in the arp entries on the backup server between the thin client and the ESX host. So, be aware that at boot-time, if ESX determines that the IP it is using for a service console is already in use, it just rips it out of the configuration and continues to boot with NO WARNINGS or ERRORS on the console.