We recently experienced a very strange issue with Exchange 2007 CCR. We had the MNS cluster w/file share witness running and the CCR mailbox servers were all replicating nicely. However, at very random intervals, replication would just stop happening from the primary to the secondary node. During these times, I could not RDP to the server, but I could ping it and log on locally so it wasn’t frozen in the literal sense. File share FROM the machine worked, but file share TO the machine didn’t. Rebooting the passive node would fix the issue. After about 4 days of troubleshooting (2 of those with Microsoft), I think the mystery may be solved. It goes something like this… [more]
In Windows Server 2003 SP2, Microsoft introduced a new set of features collectively known as “Scalable Networking Pack”. This package of features includes a TCP Chimney Offload (TOE) feature, a Receive-side Scaling feature, and a NetDMA feature. Basically, this allows network card driver developers to implement offload features on the NICs so that the a certain portion of the network stack can be offloaded to the NIC card processor. It is a great idea, but unfortunately, the driver manufacturers haven’t implemented the technology correctly. Partly because the feature set is buggy and partly because the NIC drivers are not thoroughly tested. One of the worst instances of this situation is with Broadcom NICs (yes both HP and Dell use Broadcom chipsets). Generally, what happens is that the server starts exhibiting very strange RCP-related issues. RDP may not work, management via WMI may be broken, event log viewing will be VERY slow, etc. In my case, Exchange 2007 replication stopped working. So, if you notice these types of behaviors or experience any type of issue where RCP just doesn’t seem to be working correctly, set the following registry keys to 0.
Then reboot the server. This basically turns off any offloading features at the OS level.