In the virtualisation best practice, one of the earliest best practice since many moons ago was to have network admins configure switch ports which are connected to ESXi hosts to have portfast enabled. For many who know this, I find that it tends to be just a regurgitation of what have been read. Very few people actually understood why, and how to validate that it is actually working.
Let’s start with the why. In almost all enterprise network which complex switching, the Spanning Tree Protocol (STP) are enabled. This is to avoid network loops from forming, even though physically loops have been created. Loops are bad, as they basically create a feedback, where network packets are sent and multiplied across the network. The result will simply be a packet storm that floods the LAN and clogs up the network. So, STP is important and understandly why network admins want to enable it. The drawback for enabling STP is that for any network connection that tries to come online, a series of checks and tests are done by the network switch to detect if a loops is created by allowing the port to come online. The detection process takes at least 10 seconds, and at times go beyond.
The result will be that when a port comes online, it takes quite a long while before it is allowed to carry any traffic. The switch essentially blocks all traffics until it is satisfied that there are no loops.
Looking at from the host perspective, it may see that a NIC port has come online, and may (depending on configuration policy) start sending traffic down that link. If the switch is not yet ready, all packets will be black holed. Which is a bad thing. There are many other similar situations, which we essentially want to avoid. A worst case scenario will be if a host has lost redundant uplinks simultaneously and when one returns, it takes extra time for it to be useful. If such a case happens for a VMware ESXi host that is part of a cluster, it may very well result in a long enough “outage” to trigger the HA response.
So, we don’t want that to happen.
We want any network connection to come back online as quickly as possible. Hence, we want to enable portfast. How this is accomplished varies between switch manufacturers. In Cisco Catalyst and Nexus switches, these are configured per network port. I would typically expect a good network admin to know how to make this happen.
The purpose of this post is not to show you how to do this, but as a systems admin, how to check that the network team has configured it in the way that it’s actually working the way you want.
How do you want it, and how do you test for it? It’s not that difficult, and it is much easier to be verified if you have physical access to disconnect the network cables.
The ideal time to test this is during setup, when there is no workload on the host.
- The setup – a virtualisation host, can be ESXi, Hyper-V, AHV (in fact, any physical server)
- Typically the network uplinks from the host to the switch have been teamed or bonded to provided redundancy, but this is also completely valid for situations where only a single uplink is used
- Make sure you have the ability to do a ping either from within the host out, or from outside in
The procedure
- Start with all links to be tested to be online
- Start a continuous ping that traverse the link you want to test; there must be a response, if not troubleshoot until you get one
- Physically disconnect all uplinks to be tested; the ping must start to time out, if not troubleshoot to figure out why
- This step is key – reconnect one link; and you should expect a ping response to return within 1 ping time out, which should not be more then a couple of seconds; typically if portfast is not enabled, you would see at least 4-5 ping timeouts before any positive returns are observed. Obviously take note of any bad links and ask the network team to fix it.
- Disconnect the link, and move on to the next link, repeating steps 4-5 until all links are completed.
After the network admin resolves the configuration gap, I would repeat the tests again for any links that failed previously.
From my experience, something simple like this can completely address strange behaviours where hosts are going offline for short periods without good explanation. So, taking time to verify portfast is crucial for a good deployment of any virtual environment.
Pingback: What you should consider when deploying a Nutanix cluster – vChips