Earlier this week we started the process to upgrade one of our hypervisor compute clusters when we encountered a rather painful bug with HP’s Broadcom NIC chipsets.
We were part way through a rutine rolling pool upgrade of our hypervisor (XenServer) cluster when we observed unexpected and intermittent loss of connectivity between several VMs, then entire XenServer hosts.
The problems appeared to impact hosts that hadn’t yet upgraded to XenServer 7.2. We now attribute this to a symptom of extreme packet loss between the hosts in the pool and thanks to buggy firmware from Broadcom and HP.
We experienced extreme packet loss between hosts in the cluster. With XenServer, the pool master must be upgraded first. The result was that XAPI pool management suffered a communication breakdown across the management network and complicated diagnosis. In fact, the connectivity problems went unnoticed until many hours after the master was upgraded.
At first appeared as if it was a problem caused by the pool being partially upgraded.
We wondered if we had perhaps made a poor decision to run the upgrade on a single node for a few hours to observe its performance. We made the call to upgrade another host and analyse our findings.
The next upgraded hosts appeared stable. In fact we later found this host wasn’t impacted by the bug. We then made the call to upgrade several more nodes and continue to track their stability.
After upgrading half the pool, we suddenly hit problems. VMs failed, Hosts started dropping out of the pool and losing track of the power state of running/stopped VMs.
We found that the master along with one of the other hosts were experiencing major packet loss on their management network cards. We suspected faulty NICs as it wouldn’t be the first time a Broadcom had failed us and there is no physical network cabling.
Broadcom has had its fair share of bad press over the years. Many botched firmware updates and proprietary driver issues. I’m recommending people to stay clear from using network cards based on their chipsets.
Downgrading The Firmware
As soon as we spotted the packet loss on the Broadcom NICs we upgraded their firmware to 2.19.22-1 with no improvement. We then upgraded to 2.18.44-1 / 7.14.62 again with no improvement. We even went as far as trying 2.16.20 / 7.12.83 from back in 2015 - but still no luck.
At the time of writing this no firmware downgrades (or upgrades) have fixed the issue.
The packet loss manifests itself immediately after rebooting or power cycling. But - not on every reboot!. This is the odd thing - approximately half the time when booting a host it is fine until the next boot.
We’ve compared the
modinfo output between boot cycles, we can’t find anything that stands out.
The bug seems to be caused by the version of the
bnx2x driver present in XenServer 7.2’s Kernel. Upon further reading HP recommends that you use bnx2x driver 7.14.29-2 or later, XenServer still uses the old Kernel version of 4.4.0 - that’s not currently an option.
I suspect that it’s a bug in the Broadcom firmware loaded into the NIC upon boot. I suspect a race condition related to the devices interrupt handling (MSI/MSI-X).
XenServer needs to update its kernel or at least the bnx2x driver module past the version that triggers the bug. I’ve logged a ticket for this over at bugs.xenserver.org
Additionally, XenServer didn’t notice the packet loss/network interruptions during the rolling pool upgrade. I have reported this concern and have suggested that XenServer adds pool wide checks for connectivity issues between hosts, at least during a pool upgrade.
We don’t have (a good) one.
Currently we’re simply testing for packet loss after boot on the management NIC. If detected we reboot the host and check again. This far from ideal - but until the bug is resolved there isn’t any other fix that we can find short of compiling a custom module for XenServer 7.2.
Given the widespread problems with Broadcom, we’ve ordered HP 560M, Intel based NICs to replace them.
The driver included with XenServer 7.2 that triggers the problem is
Whereas XenServer 7.0 has driver version
1.713.04 which seems not to trigger the issue:
- HP 530M Network cards (as they use the Broadcom bcm57810 chipset), commonly found in BL460c Gen8 blades and similar.
- XenServer 7.2 (Patched to the latest XS72E006 patch)
- Kernel 4.4.0+10 as found in XenServer 7.2
- Broadcom bnx2x module version 1.714.1
- HP firmware for qlogic nx2 (seemingly all versions)
- Broadcom, Die Mutha
- Bricked QLogic Broadcom BCM57840 after driver update
- HP Flex-10 10Gb 2-port 530M Adapter
- HPE Network Adapters - Updating The BNX2X Driver Package Version 2.713.30 On VMware Hosts With Certain Network Adapters Running Certain Firmware May Require A Network Adapter Replacement
- HP Qlogic NX2 Firmware