This is a common scenario that comes up when discussing resilient network design and implementation between servers and switches. For this we need to have the system and network folks together in the room. More often than not, especially in large organisations, these folks are experts within their domain, but little cross domain knowledge. I’m a strong believer of being an expert in your domain, and at the same time pick up some knowledge on adjacent domains you have to work with. I hope this post helps to build bit of knowledge for both ends.
In nearly all enterprise environment it’s almost given to see network cables carrying production traffic to have some level of redundancy. Most commonly, having a second link. Most can explain that this is for protection against a link failure, and keep services running. For connectivity between network devices, all of this connectivity redundancy is fully managed by network engineers. When it comes to connections between servers and switches, it takes both sides to mutually understand the objectives and technologies to determine the best solution. When there is lack of good cooperation, it ends up with a solution prescribed by network team. A solution they are very familiar with for network devices, but very often between servers and switches, it becomes a complex solution to implement and maintain.
So what are the real options and what is the difference in each approach? I’ll try to explain that in this post.
Money is the root of all evil
Very often the conversation between the teams often end up with the desire to have Active-Active connections. Why? because the alternative of using Active-Passive seems to be wasting 50% of the investment into the network infrastructure. Ouch!! 50%!!
How true is this perception of wastage? Is it justifiable? What are the risks it entails? Let’s see.
Stating the Business Requirement
First we establish the business objective we want to satisfy. For the purpose of illustration, let’s use the following hypothetical example.
The business application AAA-Shopping-Portal needs to have at least 99.95% availability to support the operations of AAA Online Store. Customer experience must always be at its best and any performance degradation will be considered an outage.All application components for AAA-Shopping-Portal are fully virtualised and run as virtual machines.
Translating to Technical Requirements
You may be tempted to suggest that there are many components to worry about to provide such an SLA, true. However, let’s focus on the topic of physical networking for a hypervisor host.
With such a demanding SLA, we certainly cannot afford any single point of failure (SPOF). The standard response is to have N+1 every where. Sure. For networking, means having as much redundancy as possible, from the host all the way till the packet leaves the enterprise.
- Redundant network uplinks from hypervisor host to network switch(es). (because we know links can fail due to multiple reasons, most common being accidental handling of the physical connectivity.)
- Redundant cables means the server must have at least 2 network ports (without diving into other network requirements, let’s say 2 connections are enough for this example)
- Taking it one step further, network cards can fail too, so having a second network card is a valid consideration; although this is less common nowadays and some do accept a single card with multiple ports. (Recent years of experience, we probably see more software driver failure, rather than physical card failure; so having additional card should be based on actual requirements.)
- On the other end, we typically will also see two switches. Where each of the 2 cables will go to different switches. (This is highly recommended, because switches can fail, and like systems can undergo regular maintenance. With the solution done right, allowing rolling maintenance on switches goes a long way to keeping the business application online.)
I’ve also deliberately added the requirement about performance, where more often than not explicitly mentioned. Most just assume best performance, but it’s nothing more than an assumption. Why is this important to spell it out? The reason is that there are situations where it is acceptable to have performance degradation during a failure. In some scenarios, being slow but available is a valid pain to endure during a failure. Hence, it is always important to ask, if a failure is to occur, is it acceptable for lower performance? I’ll come back to this point about degradation in a while.
For this scenario, let’s list out some decisions we are making, as well as hypothetical constraints on existing equipment and environmental standards.
- Every hypervisor host will have two 10Gbps network uplinks to the Network Infrastructure
- Each hypervisor host will have two single port nic cards, for maximum redundancy
- There are existing pairs of Top of Rack (TOR) switches that can be used
- Network switches have SFP+ ports and fibre cabling is the standard for the environment
- Server to Switch cabling will be within the rack and each host will have single uplink to each switch
So far that’s the easy part. Next, let’s look at the possible logical network configurations we can apply with the above.
Choices for Logical Network Configurations
Active-Active is one of the most abused terms in IT, and it can have many different interpretations. From the perspective of many network engineers, this mean setting up link aggregations (more often nowadays with LACP) between the Servers and Switches. That’s because that’s the same approach they would often apply to their own Switch to Switch connections. This is easy on their end as they do it regularly, and they have control of the configuration of both switches. However this gets trickier when the devices at both ends are managed by different parties; in this case Network Engineers and System Engineers.
What experience tells us is that setting up link aggregation between Servers and Switches are inherently more complex, difficult to get right the first time and just as tough to maintain. This means overall skillsets from both ends will have to go up.
Active-Active with Mac Pinning
The first option is a “switch independent” form of Active-Active connections. Independent in the way that there is no need to configure the physical switches for link aggregation towards the servers. This is when many network engineers get rightfully worried. If we are to draw out the full diagram, extending into the virtual networking, here’s what we get.
To a network engineer, the simple diagram above does look to violate many of their fundamental principles, especially without any form of link aggregation.
- There is an apparent loop that is formed
- this is true if the Virtual Switch is in fact physical, and loops will bring down physical networks.
- The answer to the network engineer on this is that, virtual switches are designed to not cause loops and therefore typical physical switch considerations do not apply.
- Loop storms aside, the other related concern is mac flap
- Mac Flap is where a switch observes ethernet packets with a mac address arriving externally at two different switch ports.
- By default, that hints of a loop, again a bad thing.
- Switches have protection mechanisms to block traffic due to loops, and mac flap is one sign the protection mechanism monitors for.
- The answer to the network engineer on this is that, this is a different sort of Active-Active load balancing. Active is true in the sense that both links will pass traffic, but each mac address will be sticky to a fixed port. Hence, the individual mac addresses will not flap between ports. However, the collective usage by many virtual workloads will make it appear that both links are active.
This option is one of the most popular way to set up network load balancing from the server end, as it has lower complexity and allows “maximum usage” of all available connections.
Implication for Virtual Workloads
With this choice, virtual machines will have at most access to the bandwidth of one uplink. In this case, 10Gbps. However, the overall picture for the host, a total of 20Gbps is available to be shared across multiple virtual machines. Very often, this is more than enough.
Other Considerations for Active-Active with Mac Pinning
Is this choice to actively expose all available bandwidth for us necessarily a good thing? Remember the point about no performance degradation? Do realise that this means that even during an uplink failure, the network performance must not be compromised. Taking a few steps back, it implies that even without failure, the overall network utilisation of both uplinks must each not exceed 50%. Doesn’t it mean that at only 50% utilisation, we are not fully utilising our investment? In a way, it is the same as an Active-Passive set up. I think the saving grace is that more often than not, network utilisation tend to be not that high, and so many get away with this. Is your organisation ready to take the risk on this? Do your risk assessment. If the organisation can’t take the risk, how do you guarantee that the user experience during a failure scenario is the same as during normal situation?
Summary of Active-Active with Mac Pinning
- (Pro) Easy to Set Up for both Network Engineers and System Engineers
- (Pro) All virtual machines have full access to all physical bandwidth
(Risk) the bandwidth utilisation can exceed the capacity of one link
- (Pro) Load Balancing is fully controlled from Hypervisor Host
(Risk) System Engineer may pick the wrong Load Balancing algorithm which can result in suboptimal performance, or worse, flaky connectivity.
- (Con) This needs to be effectively communicated to Network Engineers for them to be comfortable that this is a valid solution. Hopefully this post can help with this challenge.
- (Con) Communications between hosts may not be as efficient as different hosts may be passing traffic up to different switches. Hence increased dependency on the ISL. (This may make more sense after the next section.)
Active-Active with Link Aggregation
This is another option, and there are a few things to consider. This is a “switch dependent” solution, meaning the switch(es) must be configured in a fully compatible way with the host. This is what network engineers are typically used to. This method has two variants, Static Link Aggregation and Dynamic Link Aggregation. Let’s have a look.
- For a start, additional configurations must be performed on the network switches to (1) create the new channels, and (2) pick the exact pairs of ports to be added to each channel. (added effort and complexity)
- cabling from the switch ports to the servers must be perfectly accurate, any wrong connections can mean a world of pain in tracing the issue; depending on the skill level of both the network and system engineers. (added effort and complexity)
- A decision need to be made whether to use static link aggregation or LACP assisted dynamic link aggregation. (Read my earlier blog post about them).
- If this approach the chosen option, link aggregation with LACP is generally preferred.
Active-Active with Dynamic Link Aggregation (with LACP)
Here let’s dive a little into hypervisor specifics around LACP related configuration.
- If the environment is using VMware ESXi, it means a must to spend the money to procure the highest level of license, Enterprise Plus. (cost implication) For anyone who has set up VMware Distributed vSwitch with LACP uplinks, it is not a simple task and it is risky, and things can go wrong. (added effort and complexity) While dvSwitch is to help simplify network management, setting up LACP is still a per host activity.
- On Microsoft Hyper-V, LACP is also support. With my limited experience with regular NIC Teaming (LBFO), it is much easier to configure than on ESXi, and doesn’t cost extra. (Not sure how it is with the new Switch Embedded Teaming (SET) or Logical Switches with SCVMM.)
- On Nutanix AHV, Active-Active with LACP is also easy to configure. From AOS 5.11, there’s a simple wizard in Prism to configure for all nodes throughout the cluster. With AOS 5.10 and earlier it’s a mere few cli commands to execute. (saving this for another blog post later.)
Alright, so LACP set up has its own complexity, but overall, it is the lesser of two evils, compared with Static Link Aggregation.
Active-Active with Static Link Aggregation
I would normally avoid this as much as possible. Here is why. This options requires both ends to be set up to assume that
- cabling is done perfectly
- both switches and hosts are perfectly configured
As this method operates based on assumptions, it is critical during implementation that appropriate testing must performed to determine that connectivity is working end-to-end.
The above fundamental steps look similar to a LACP set up. Here’s the key difference, if LACP set up fail to negotiate between both ends, the aggregated link will not form. The links would either completely fail, or fallback to be individual links. This depends on switch capabilities and configurations. With LACP, there is a clear indication to show whether both ends have formed aggregated links. There’s no simple indication for Static Link Aggregation set ups. Hence the 3rd point above for testing.
Link Aggregation Implications for Virtual Workloads
With this choice, virtual machines will have access to the combined bandwidth of two uplinks. In this case, 20Gbps. This can only be achieved when a virtual machine is having multiple network sessions that terminates outside the host, and each session is capped at the limit of one uplink. A common justification question being asked is “is there any virtual workload(s) that need more bandwidth than one of the link provides?” Here is the risk of this question. If the answer is yes, and we have only two uplinks, it inherently puts the risk of performance degradation when one link fails. Which means for our scenario where lower performance is never acceptable, we should be choosing Link Aggregation for the higher available bandwidth. A more appropriate choice should be going for faster individual links instead.
Other Considerations for Active-Active Link Aggregation
Similar to the other choice of Active-Active with Mac Pinning, with the overall bandwidth being available to the workloads, it opens up the risk of over utilisation beyond 50%.
A Hidden Gem
With all things in life (and IT), there are always trade offs and benefits of each choice. The largest trade off of this approach is increased complexity. What benefit does link aggregation bring?
One little known behaviour of using Link Aggregation (both static and dynamic), is that it can significantly improve the efficiency of inter host communications. Particularly when the hosts are connected to the same switches. (Detailed explanation for another post.) This can be very beneficial for the scenario where workloads are distributed across several hosts in a cluster, and can exchange a good amount of network packets. Nutanix Hyperconverged Infrastructures (HCI) fall into this category. If all the nodes in a Nutanix cluster are hooked up to a pair of switches, and Link Aggregation is set up for all hosts, all Layer 2 (L2) host to host communication will be switch local. There is no need for network packets to traverse across interswitch links (ISL) to get from one host to another.
Summary of Active-Active with Link Aggregation (Static and Dynamic)
- (Pro) Full network bandwidth is available to virtual machines
(Risk) bandwidth can be over utilised beyond 50%
- (Pro) allows efficient host to host communication within the same switch pairs. Cuts down ISL utilisation, and therefore dependency on ISL bandwidth.
- (Con) More complex to set up, although Network Engineers can be familiar, System Engineers are not. Overall solution complexity has gone up a few levels.
- (Con) Network switches must also be set up in a way to allow Link Aggregations towards the hosts.
- (Con) Day 1 set up is challenging, and Day 2 operations to troubleshoot is even more challenging. If something is to go wrong, symptoms are often obscure.
- (Pro) Dynamic with LACP has superiority over Static, and should be the default choice when going with Link Aggregation.
This option is one of the simplest to set up and manage. There is however, a common misconception. This set up is also “switch independent”. Meaning, there should be no additional or special set up required from the network switches to support this. All configurations are only done from the host end. The misconception is that the switches are the ones in an active-passive mode, and that is incorrect. Switches are always forwarding packets. There may be roles between the switch pairs that are active-passive, but that’s a switch management topic, nothing to do with packet flow. Where it is truly active-passive is from the host perspective where it only actively sends network traffic up one uplink. The other uplink, though is online, but no traffic is sent through. From the switch perspective, both links are online, however the active link will observe registered Mac addresses and the other will not have any.
Active-Passive Implications for Virtual Workloads
With this choice, virtual machines will have at most access to the bandwidth of one uplink. In this case, 10Gbps. The overall bandwidth that the host will use is also 10Gbps. The passive uplink will become active when there is a failure of the active uplink, or there is an administrative change on the host.
Other Considerations for Active-Passive Configurations
Unlike to the other choices of Active-Active configurations where there are risks of over utilising the uplinks. The active-passive solution can be configured to strictly never exceed one uplink. Among the options presented in this post, this is the only configuration that does not rely on other techniques to guarantee consistent network performance even during a network failure. Consistent in the sense that it should be the same as when both links are online.
Summary of Active-Passive Configuration
- (Pro) Switch independent set up that doesn’t need any special configuration by Network Engineers.
- (Pro) Also easy to set up by System Engineers, however effort varies with different hypervisors.
- (Pro) Ongoing operations of this configuration is straightforward to troubleshoot and rectify.
- (Con) Communications between hosts may not be as efficient as different hosts may be passing traffic up to different switches. Hence increased dependency on the ISL.
- (Risk) If there are overall traffic bursts in a host that saturates 10Gbps, the other uplink will still not be utilised. This can also be useful, as it is easier for capacity planning to clearly know that 10Gbps is insufficient, and and upgrade may need to be considered.
There can be many factors to consider in determining the most appropriate host network set up. I don’t believe in a single best practice as there is never a one size fit all solution. Always understand what are the requirements, constraints and risks, so as to make the best choice to fit the scenario.
Making the final decision involves collective effort from both Network and Systems, to not only look at technical features, but also combined skillsets. While Link Aggregations with LACP may seem to deliver the most bandwidth to a workload, but if the engineers supporting the environment are insufficiently skilled, it can be more harm than good.