Blog: Proliant

After an HVAC crew at a customer's DR site triggered multiple power outages, their recently repurposed ESX host (an HP ProLiant DL360 G9), which wasn't connected to a UPS at the time, would no longer pass POST. It would stop at 'Loading System Firmware Modules' before faulting and hitting the RSoD. After trying multiple options we thought the power surges/outages might've caused a hardware failure, but the errors from the RSoD didn't seem to indicate this to be the case.
 
I found a related article (https://support.hpe.com/hpsc/doc/public/display?docId=mmr_kc-0128466) that is geared toward issues relating to BL460c's not being able to POST after firmware upgrades, and decided it was worth a shot before going down the hardware replacement route.

"SYMPTOM: Server May on Rare Occasions Stop Responding during Power-On Self-Test (POST)
This issue occurs because the server reads unexpected data values from the Non-Volatile RAM (NVRAM) or has found a boot block corruption and may exhibit one of the following symptoms:
•       Server may not display video
•       Server NIC port may be disabled
•       Server may not boot
 
Cause
Non-volatile ram (NVRAM) holds its state after the master device/circuit is powered off. Hardware typically use CMOS (complementary metal oxide semiconductor) to implement NVRAM and incorporate a battery power source to retain system settings. That clears the current assignments of IRQs and such. Unless user have a hardware conflict.

On the system board, there exists a 'System Maintenance Switch' with multiple pins for performing different actions. We had to power down the server, then switch pin 6 (Clear CMOS and NVRAM) to the ON position, power up the server to clear NVRAM, power it back down and change the pin position back to off, and finally power it back up. Thankfully, this cleared up the issue completely and the server could boot up without problem. Just keep in mind all your potential alternatives before assuming a hardware failure.


 

There are power management settings that should be checked when running ESX on HP Proliant G6 and above or Dell PowerEdge 11th and 12th Generation servers.  See VMware article for details: https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1018206‚Äč 

The Proliant G8 that I examined having performance issues was set in the BIOS to use "HP Dynamic Power Savings mode" instead of "HP Static High Performance mode".  This can have an impact on virtual machines ability to utilize the CPU of the host.   This setting can be changed through iLO without the need to get into the BIOS directly to make the change.  It does not require a reboot of the ESX host to change the setting this way, which is even better.


 

You know the situation… you’re keeping lots of plates spinning, multi-tasking, generally having a productive day when you look down and realize that you have 87 windows open on your task bar…the Windows Weeds. In my particular situation I was about to install a Proliant Support Pack (a conglomeration of driver updates approved by HP for a particular server model). The order of events was as follows: [more]

  • Downloaded compressed PSP from HP website to a network location
  • Extracted PSP to a child folder on the network
  • Was planning to copy the extracted files from the network to a Temp directory on the terminal server I was working on to perform a local installation of the PSP
  • Received a support call for a different problem and turned my attention to that task for a while
  • Came back to the PSP task. On the terminal server to be updated, I saw that there was already a d:\cnx\temp folder that had what appeared to be my extracted PSP files in it. With so many windows and directories opened I thought I had already copied the extracted files over to the local directory before I got called away to the other task.
  • Upon installing the extracted PSP files from the local directory, several core networking components crashed rendering the teamed NICs on this server useless. Re-running the PSP as a whole as well as just installing the updated networking driver portion of the PSP did not help at all.

As it turns out, the directory and PSP files that were in d:\cnx\temp were created by another engineer during maintenance procedures several months ago. It was just coincidence that the same type of installation files (just an earlier version) were in the exact network location I was going to create…what are the odds?!? When extracted, PSP files from different versions look the same, so there was nothing to tip me off that I was actually installing an old (and also corrupted) PSP.

Lessons learned:

  1. Keep a tidy task bar
  2. Clean up old temp files when finished with them
  3. Question your presuppositions if things aren’t adding up (i.e. “I KNOW these are my PSP files”)
  4. The unlikely is still a possibility (ie. same files –just diff versions- in the same directory).