Blog: Veeam

Recently, quiesce snapshot jobs for some customers kept showing up as failed in Veeam with the error "msg.snapshot.error-quiescingerror".

After exhausting several research options I called VMware Support and we began sifting through event log files on the server as well as looking at the VSS writers and how VMware Tools was installed.

Looking at the log files on the ESX host the server was on led to this article:

https://kb.vmware.com/s/article/2039900

A folder named backupScripts.d gets created and references a path C:\Scripts\PrepostSnap\ which is empty. Therefore the job fails. The fix is found below:

  1. Log in to the Windows virtual machine that is experiencing the issue.
  2. Navigate to C:\Program Files\VMware\VMware Tools.
  3. Rename the backupscripts.d folder to backupscripts.d.old

If that folder is not present, and/or if the job still fails, the VMware services checked.


 

After we completed a customer’s upgrade to ESXi 5.5.3, their Veeam jobs started failing, with an error message stating the files for the virtual machines did not exist or were locked. Since the VMs were migrated to a new ESX host as a part of the upgrade, I thought the old hosts may have put a lock on some of the VM files for some reason, so I shut them down. After they were shut down, the jobs still failed but the error message changed saying that the backups failed because a NFC storage connection was not available.

Research of this error led me to an article (https://www.veeam.com/kb1198) which directed me to some backup log files. In these backup log files, I kept entries indicating Veeam was trying to establish a connection with the SSL server, but it failed due to an unsuccessful SSLv3 handshake since ESXi 5.5.3 disables SSLv3 due to vulnerabilities with the protocol.

Some more research led me to another Veeam KB article (https://www.veeam.com/kb2063) stating that this was a known bug with Veeam 7.0. The article says, “Veeam Backup & Replication is designed to use TLS or SSL, however a bug in parsing the list of supported SSL/TLS protocol versions within Veeam Backup & Replication when communicating with VMware causes the job to fail without attempting to use TLS,” and the solution is to upgrade to Veeam 8 update 3. Since this customer’s Veeam renewal was coming up, I went ahead and upgraded them to Veeam 9 and, after doing so, their backups started running without any issues.


 

One of our customers reported their Veeam backups were failing. We determined the cause to be the vCenter services were stopped and would not restart. The vCenter issue was a result of the SQL Express database having grown to its 10GB maximum size. We were able to get the vCenter services running temporarily by purging performance data from the database using the procedure at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007453. [more]

This procedure removed enough data to get the services running, but didn’t reduce the overall size of the database significantly. I found a VMware SQL stored procedure named “dbo.cleanup_events_tasks_proc” that reduced the size of the database by 60%. After a couple of shrink file operations, the database and the vCenter services were up and running. 

However the Veeam backups failed yet again the next night. While the Veeam errors indicated that the vCenter services were again offline, this time it was because the virtual disk containing the SQL Server Express vCenter database was completely full. The transaction log for the vCenter database had bloated to 24GB and filled up the disk. This was confusing initially because I had checked the recovery model of the database prior to running the stored procedure to make sure it was set to “Simple” to prevent this very issue. 

With SQL Server the growth of the transaction log is directly proportional to amount of “work” that SQL Server has to perform between BEGIN TRANSACTION and COMMIT TRANSACTION commands. Certain SQL Server commands (insert, update, and delete) are always wrapped in implicit transactions. But some bulk operation transactions can be executed with explicit BEGIN/END TRANSACTION commands to control roll back. The stored procedure that I ran wraps a potentially large batch purge process in a SQL transaction that enables the entire process to be rolled back in the event of a failure. In this case, the lengthy stored procedure resulted in a ridiculously huge transaction log. Lesson learned is that “Simple” recovery model doesn’t guarantee the transaction logs will always be a manageable size.


 

Block level vmdk backups have limitations that will GET YOU.  Backup Exec and Veeam both have the ability to backup the vmdk files in a VMware environment and still retain enough information in the backup set to do individual file level restores.  However, both products will ONLY work if you have vmdk disks partitioned using the MBR (Master Boot Record) type tables and NOT the more modern GPT (Guid Partition Table) structure.


 

I learned the reason that VMware suggests having service consoles for ESX hosts on at least two distinct networks last week. I was troubleshooting intermittent backup issues with Veeam on a customer network and couldn’t really find any pattern to the failures. Two or three backups in a row would run successfully, then 5 in row might fail. The behavior was very random. However, the failures were always on Virtual Machines associated with a specific ESX host. At first I thought the host was healthy, but after watching the VI client for an extended period of time, I noticed that the ESX host would drop offline (showing disconnected in the VI client) and then come back online again.  This indicated the problem wasn’t just affecting the management/backup server. [more]

In order to level set my troubleshooting efforts, I decided to reboot this ESX host. However, after the reboot, I could not connect to it with the VI client. I could ping the IP assigned to the service console, but couldn’t SSH or connect via the VI client. I logged in via iLO and found that an ifconfig at the command line returned IP = 0.0.0.0…..interesting. So what is responding to my pings. I checked the arp cache on one of the switches and found that a thin client had been plugged in that had the same IP as my LAN service console. What is really odd is the MAC address for the thin client was all zeros AND the IP I was using for the LAN service console is not even available to be distributed by DHCP. I was not able to connect to the thin client to see how it was configured, but I was able to connect to ESX host via a second service console port that I placed on the iSCSI network. The management/backup server has a connection to the iSCSI network to do backups to disk so I was able to change the LAN-facing service console IP to another IP and everything started working fine. The backup issue was obviously being caused by changes in the arp entries on the backup server between the thin client and the ESX host. So, be aware that at boot-time, if ESX determines that the IP it is using for a service console is already in use, it just rips it out of the configuration and continues to boot with NO WARNINGS or ERRORS on the console.