I learned the reason that VMware suggests having service consoles for ESX hosts on at least two distinct networks last week. I was troubleshooting intermittent backup issues with Veeam on a customer network and couldn’t really find any pattern to the failures. Two or three backups in a row would run successfully, then 5 in row might fail. The behavior was very random. However, the failures were always on Virtual Machines associated with a specific ESX host. At first I thought the host was healthy, but after watching the VI client for an extended period of time, I noticed that the ESX host would drop offline (showing disconnected in the VI client) and then come back online again. This indicated the problem wasn’t just affecting the management/backup server. [more]
In order to level set my troubleshooting efforts, I decided to reboot this ESX host. However, after the reboot, I could not connect to it with the VI client. I could ping the IP assigned to the service console, but couldn’t SSH or connect via the VI client. I logged in via iLO and found that an ifconfig at the command line returned IP = 0.0.0.0…..interesting. So what is responding to my pings. I checked the arp cache on one of the switches and found that a thin client had been plugged in that had the same IP as my LAN service console. What is really odd is the MAC address for the thin client was all zeros AND the IP I was using for the LAN service console is not even available to be distributed by DHCP. I was not able to connect to the thin client to see how it was configured, but I was able to connect to ESX host via a second service console port that I placed on the iSCSI network. The management/backup server has a connection to the iSCSI network to do backups to disk so I was able to change the LAN-facing service console IP to another IP and everything started working fine. The backup issue was obviously being caused by changes in the arp entries on the backup server between the thin client and the ESX host. So, be aware that at boot-time, if ESX determines that the IP it is using for a service console is already in use, it just rips it out of the configuration and continues to boot with NO WARNINGS or ERRORS on the console.