Blog: CommVault

A coworker and I have been doing a lot of work on the CommVault email archiving and compliance products here lately. CommVault email compliance solutions provide two ways to access data collected via email compliance archiving agents. The end-user compliance portal allows a user to log in and search only their email whereas the compliance portal allows search of all email that has been collected via journaling. The issue we were able to reproduce was the following:

A user with a specific employment date (lets say 10.1.2010 for instance) could log in and see email that was sent prior to his/her employment date. They couldn’t see ALL email, just certain email. [more]

Long story short, as part of a troubleshooting task with CommVault support, our customer had created  a “special” configuration that enabled the compliance agents to basically harvest all mail in the Exchange environment from all mailboxes. Part of the work that the CommVault indexing engine does is to look at the email message and “mark” the message in such a way that it can be found by associated parties via the end-user search portal. It does this by looking up all parties on the email in active directory, then it associates the message with all the user GUIDs that should have access to the message via end-user search. In our case specifically, when all the emails were “harvested” from all exchange mailboxes, a specific set of emails that were sent to a distribution group were pulled in. The indexing engine expands those distribution groups and links the GUIDs accordingly. Emails to that distribution group go back farther back in time than the employment of the user in question, but the user is CURRENTLY a member of the distribution group. So, when the indexing server expanded the group, that user was associated….and viola, access to an email prior to employment via end-user search.


 

On any VMware virtual machine running Windows 2008 or 2008 R2 that was created using v4.1, the advanced configuration parameter disk.enableUUID is set to TRUE. Basically, this enables application-level quiescence in the VM. If the VM was created on ESX prior to v4.1, the advanced configuration setting does not exist. So, if you want to get application consistency on a VADP (vStorage API style) initiated backup, it won’t happen if that setting isn’t set to TRUE. This is a problem because a number of vendors (CommVault included) don’t support this feature yet. Since it is a default for new VMs, they won’t back up correctly.

The bottom line is... make sure you are absolutely sure you are getting application consistent backups by checking the app logs on the VM when doing the backup. You may not be getting as consistent of a backup as you think.


 

I had an IT consulting customer email me requesting assistance with extending the system partition on a Windows 2003 virtual machine. The partition had been running low on disk space for a while. The customer had extended the vmdk using VMware, but was unable to extend the partition using diskpart. This is normal behavior for a Windows 2003 system so I scheduled downtime so that I could use VMware Converter to fix the problem.

I have done this operation a number to times in the past. You simply tell Converter to convert the VM and target the same ESX cluster with the imported copy. During the operation, VMware gives you the option to change the partition size. Windows recognizes the partition size change at first boot and you are good to go. However, the customer failed to tell me that they had un-marked the c:\ drive partition as active while trying to get the disk to extend. When I shut the VM down to clone it, it never came back up. Neither did the imported copy. Both were completely useless. They would boot to an “Operating System not found” error. [more]

I tried fixboot and fixmbr from the recovery console but neither worked. I ended up restoring from a CommVault backup. Later, based on some comments from coworkers, I decided to see if I could fix this problem by mounting the disk to another VM and adding back the “active partition” status. I mounted the vmdk that was broken to a Windows 2008 server and using disk manager re-marked the partition as active. Sure enough, after dismounting from the temp VM the original VM booted up no problem. Just one more reason to use virtual machines.


 

The CommVault Exchange Mailbox iData agents do not backup mailboxes associated with disabled Windows user accounts. The backup job reports a "success" for the job, but when the details of the backup are explored, the backup set does not contain any data. Additionally, requesting a listing of all failed objects for the backup job results in a "no failures" status. According to CommVault, this behavior is by design as is the "successful" backup status. After all, the job did not technically fail if it is not designed to include mailboxes belonging to disabled user accounts. This is very strange given that, in general, CommVault iData agents have an "inclusive by default" behavior.  This can become a real problem if you try to restore data for a former employee whose Windows user account was disabled when they left the company.  The lesson here is that you should always test your backups. Even if the backup report and all job status notifications indicate you are good....test anyway.


 

One of our customers uses VMware VCB backups integrated with CommVault Simpana. The CommVault job simply calls a pre-backup script to snapshot the VM and copy all the VM files to the VCB proxy, backs up the files from the proxy to the CommVault media server, then a post-backup script commits the snapshot and purges the VM files from the VCB proxy.

Recently, we upgraded this customer from VMware VI3.5 to VMware vSphere v4 Update 2. For most of the VMs that are backed up with VCB, we had no issues at all. The backups ran the weekend following the upgrade with no issues. However, all of the VMs that had been secured with the Windows Security Configuration Wizard would not back up. These VMs are in the DMZ and are locked down very tight because they host externally available web applications. The issue is that each time a backup was initiated from CommVault, the VCB script would return a non-zero error due to a snapshot failure in VMware. VMware’s error was “Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine.” This would happen when using VCB scripts, but I could create a snapshot without error from the VI client. [more]

After much research and testing, I determined that the problem was hold-over from the VMTools upgrade. In the new version of VMTools, a new service is installed called VMware Snapshot Provider is installed. This service gets installed when VMTools is upgraded. Its purpose is to help facilitate application consistent snapshots through the VMTools. On the servers that were getting the “quiesced snapshot error”, this service was not present at all, but VMTools had already been updated…very strange. Here is where the Security Configuration Wizard comes in. Part of our lockdown policy is to disable a service called COM+ System Application. This service manages the configuration and tracking of COM+ based components. Apparently, without this service enabled, VMTools upgrade will NOT install the VMware Snapshot Provider service. Without the service, no quiesced snapshots and you get errors when creating snapshots via the VCB integration modules.

So why could I create a snapshot from the Vi client? Well, VMware knows that you are using VCB to create snapshots for the purpose of backup. What good would the backup be if it wasn’t app consistent? The VI client, on the other hand, will first try to create an app consistent snapshot, but if it fails or times out, it will go ahead and create the snapshot “crash consistent” without error. VCB is not as forgiving. If the guest quiesce fails, the snapshot fails…end of story. The solution was to uninstall the VMTools, reboot, temporarily enable and start the COM+ System Application service, install VMTools, then disable the COM+ System Application service. After I did that, backups have been running fine since.


 

I had an issue come up with Platespin the other day that was very strange. The Platespin protection job for a server hadn’t been completing successfully since it was upgraded to a new build. The job would run all the way up until the point where it was doing the VSS snapshots of the source machine, then it would die with a very cryptic VSS related error. This would cause the VSS System Writer to display an error state when using vssadmin list writers. I engaged Platespin support, and after about two weeks going back and forth with their support, they finally cut me loose with a “call Microsoft” recommendation. I kept troubleshooting and found that when I would try to clear out the VSS snapshots by changing the maximum space setting for the VSS snapshots to 300 MB (which is supposed to be the minimum amount required for an x86 system), I would get an error pop-up noting that 300 MB was not sufficient amount of space for snapshots on that volume. I finally found by process of elimination that 1800 MB was the magic number for the c: volume. However, even though the drive had over 2.5 GB of free space, the PlateSpin job would still fail. [more]

As a last resort, I changed the storage location for the VSS snapshots for the c: partition to the d: partition (which had over 20 GB of free space) then ran the job again. This time, the job ran a little bit farther, then died when trying to snapshot the f: partition (which was only used for a page file). After moving the VSS snapshots for the f: partition over to d:, the job ran successfully…finally. What was very strange is that the VSS snapshots would always reserve the same amount of space for each partition as the maximum setting for the c: partition. I could change the maximum space setting for snapshots on the c: partition and run the job again and the snapshots for all partitions would match the c: partition no matter what the maximum setting that I had specified for the individual partitions. I could snapshot the partitions with vssadmin and this did not happen and when backing the server up with CommVault (which uses VSS) it didn’t happen….only PlateSpin. I looks to me like their software has a bug in it.  I have emailed their support tech I was working with to explain what I found…no response so far.


 

We have been having trouble with a SCSI card that was attached to a tape drive that was installed on a CommVault Media Agent server. The card was brand new and the drivers were Windows 2k3 certified. We started having issues with this server during the CommVault install. The server would just spontaneously reboot leaving the CommVault backups in disarray. Troubleshooting led us to update the firmware on the card, the tape library firmware & driver, and the tape drive firmware & driver. This fixed the problem for a few days and it would happen again. It would only happen when doing an auxiliary copy from disk to tape. After some deep-dive troubleshooting on the SCSI I/O bus, we were able to get some logs during the time immediately before one of the spontaneous reboots/failures. From the logs we were able to find that the card actually had some type of problem that caused extended I/O latencies during periods of high traffic (aux copies). We ordered an Adaptec card and installed it. Now, not only are copies to tape 2x faster, it hasn’t crashed . . . yet.