A couple weeks ago, one of our customers had their Exchange SCR copy fail due to a corrupt log file. At first we assumed that the log file was corrupted during transit to the DR site, but after recopying the log file over multiple times and attempting to restart replication, we realized the log file was actually corrupted on the source server which is a virtual machine. I had never seen this happen before and was a little surprised that the corrupt log file had not taken the mailbox database offline. With nothing to attribute the corruption to, I decided it must have been a fluke and started a database reseed the following weekend. After 3 days, the database seeding finished, but 4 hours after the reseed completed, the SCR copied failed again…another corrupt log file. [more]
I decided there must be a bigger issue. I reviewed the logs and found numerous eventid 7 errors (bad block on disk) and a few pvscsi warnings. It seemed logical that maybe the paravirtualized SCSI adapter that was being used on this virtual machine may be causing an issue…maybe it was a weird PVSCSI / Windows 2008 server problem. I had to take a break from this issue to troubleshoot another server issue for the same customer. In doing so, I had an idea…what if the physical disk is going bad, but hadn’t completely failed. Could that cause the underlying VMware VMFS partition to look fine but cause problems with virtual disk files attached to VMs. I used iLO to check out the hardware status and sure enough one of the disks had encountered numerous SMART errors and was marked “impending failure”. The array was not degraded yet because the disk had not completely failed. I have replaced the disk and will reseed the database soon, but since replacement there have been no bad block on disk errors on this VM so it looks promising.