A common misunderstanding engineers have about Equal-Cost Multi-Pathing (ECMP) and port-channels is that they increase the bandwidth that can be used between two network devices. This *can* be true, but isn't *always* true.

Curious why? 🧵
First, let's review our topology. Three Cisco Nexus switches are connected in series. Traffic generators are connected to Switch-1 and Switch-2 through physical interface Ethernet1/36. Switch-1 and Switch-2 connect to Router through Layer 2 port-channels.
As the names suggest, Switch-1 and Switch-2 are purely Layer 2 switches. Router is a router that routes between two networks - 192.168.1.0/24, and 192.168.2.0/24. The traffic generator mimics four hosts - two in 192.168.1.0/24, two in 192.168.2.0/24.
Let's say Host-1 (192.168.1.10) sends a packet to Host-3 (192.168.2.10). Switch-1 will inspect the destination MAC address of this packet and know it needs to forward it out of interface port-channel1. However, how does Switch-1 know whether to use Ethernet1/1 or Ethernet1/2?
Switch-1 will use a process called "hashing" to figure out this problem. Hashing is when a network device selects one physical interface out of a group to use when forwarding a specific packet.
The method a network device uses to perform hashing is called a "hashing algorithm". Hashing algorithms are deterministic - meaning, given the same set of inputs, the outcome of the algorithm will always be the same.
Hashing algorithms are platform-specific (meaning, they will change depending on the type of switch) and are considered proprietary (meaning, network vendors like Cisco, Juniper, and Arista will never release the hashing algorithm they use in their products to the public)
One set of inputs for the hashing algorithm is the number of interfaces in the port-channel/ECMP group. Another set of inputs is information about the packet that needs to be forwarded, which we call a "flow".
Simply put, a "flow" is a set of attributes that uniquely identify one packet from other packets. To use the same terms defined by Cisco's NetFlow protocol, each attribute is called a "field".
Each network device in the path of a particular set of traffic can define a flow differently using a tuple of fields called "key fields".
In other words, one device may say that a unique flow consists of the source and destination MAC address key fields, while another device may say that a unique flow consists of the source and destination MAC address and IP address key fields.
For a practical example of this concept, let's two ICMP Echo Request packets - one sent from Host-1 to Host-3, and another sent from Host-1 to Host-4. If we were to capture these packets as they enter Switch-1, their headers would look like this.
As you can see, the only difference between these two sets of packets is the destination IP address.
However, if Switch-1 considers only the source and destination MAC address fields to be key fields, then SWITCH-1 would consider both traffic patterns to be identical flows, even though they're destined to two different IP addresses.
Most network devices have a command that let's you see what key fields it considers in its hashing algorithm. On NX-OS, this command is "show port-channel load-balance". On Cisco IOS, this command is "show etherchannel load-balance".
Note that even though Switch-1 is a Layer 2 switch, its Application-Specific Integrated Circuit (ASIC) is perfectly capable of parsing Layer 3 and Layer 3 headers of incoming packets in addition to Layer 2 headers.
Just because Switch-1 is a Layer 2 switch doesn't mean it isn't able to read/understand IP addresses and TCP/UDP headers - it just means it won't make *forwarding decisions* based on Layer 3 or Layer 4 headers. However, Switch-1 will factor this data into its hashing algorithm.
In our topology, both Host-1 and Host-2 are continuously sending traffic to both Host-3 and Host-4 in a full-mesh fashion. We can use interface counters on Switch-1 to see how traffic is hashing on Switch-1.
As you can see, all traffic is traversing Ethernet1/1 of Switch-1. Ethernet1/2 is completely unused, even though it's up/up and bundled within port-channel1.

What's going on here? Is this a bug? Do I refund my expensive data center switches, since they're clearly broken?
Nope! This is 100% expected behavior. Simply put, we're getting "lucky" (or unlucky, depending on your perspective) with the hashing algorithm. All of our traffic happens to hash to the same physical interface within the bundle - an event we like to call "polarization".
This means that even though two 10Gbps interfaces are bundled together, you may not get 20Gbps of bandwidth out of the bundle. If the majority of your flows (or at least your *high-bandwidth flows*) hash to one member of the port-channel, you're still oversubscribed.
In my scenario, the packets generated by the traffic generator do not have a Layer 4 header.
Since Switch-1's hashing algorithm can use Layer 3 *and* Layer 4 headers, let's add a UDP header with randomized ports to add some entropy to our traffic flow.
Now that we have a Layer 4 header to work with, we can see the traffic is very nicely load-balanced across Ethernet1/1 and Ethernet1/2! With this traffic pattern, we should be able to squeeze 20Gbps of bandwidth out of this port-channel.
Next, you may want to know how *exactly* hashing works. As stated before, the exact hashing algorithm used by a switch is proprietary information.
However, at a high level, the switch will do some quick math on the relevant headers of the packet (selected based upon the port-channel's load-balancing algorithm) and the algorithm will spit out an integer number. Let's say the number the algorithm spits out is between 0 and 7.
Internally, the network device will assign each physical interface in the port-channel or ECMP group a number. In our example, Ethernet1/1 may be internally assigned 0, 1, 2, and 3. Ethernet1/2 may be internally assigned 4, 5, 6, and 7.
If the number spit out by the algorithm is 0, 1, 2, or 3, then the switch will forward the packet out of Ethernet1/1. If the number spit out by the algorithm is 4, 5, 6, or 7, then the switch will forward the packet out of Ethernet1/2.
In the case of unequal-cost multipathing (also called "unequal-cost load balancing" by some technologies, such as EIGRP), you will have two paths.
Path 1 might have a lower metric, so internally it'll be given hashes 0 and 1. Path 2 might have a higher metric, so internally it'll be given hashes 2, 3, 4, 5, 6, and 7.
If the number spit out by the algorithm when run against a flow is 0 or 1, then Path 1 will be taken. If the number spit out by the algorithm when run against a flow is 2, 3, 4, 5, 6, or 7, then Path 2 will be taken.
This algorithm is not aware of the *amount* of traffic that belongs to a given flow. The flow could be a few kilobits per second, which would probably be fine to flow down Path 1. It could also be several gigabits per second, which could cause congestion further down the path.
The router doesn't keep track of how much traffic is present in a flow (or at least, if it does via sFlow or NetFlow, the data gathered by that technology does not impact the router's forwarding decision made on that packet/flow.)
The way load-balancing works through hashing is why vendors recommend that when you use ECMP or port-channels, you have a number of interfaces equal to some power of two (such as 2, 4, 8, 16, etc.) within the ECMP or port-channel.
Using some other number (such as 3, 6, etc.) will result in one or more interfaces being internally assigned less hash values than other interfaces, resulting in unequal-cost load balancing.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Christopher Hart

Christopher Hart Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @_ChrisJHart

17 Oct
"I see a lot of packet loss when I ping my switch" 🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩

Wait, why is this a red flag? Let's dig into this behavior in a bit more detail... 🧵
First, let's take a look at our topology. We have two hosts in different subnets that connect to a Cisco Nexus 9000. One host connects via Ethernet1/1, and the other connects via Ethernet1/2. Ethernet1/1 has an IP of 192.168.10.1, while Ethernet1/2 owns 192.168.20.1.
The architecture of most network devices has three "planes" - a data plane, a control plane, and a management plane. We'll focus on the first two. The data plane handles traffic going *through* the device, while the control plane handles traffic going *to* the device.
Read 19 tweets
15 Oct
On Cisco Nexus switches in production environments, avoid working within a configuration context on the CLI unless you're actively configuring the switch. Otherwise, you might accidentally cause an outage by trying to run a show command.

Curious how that's possible? 🧵
Cisco IOS and IOS-XE require you to prepend show commands with the "do" keyword to execute them within a configuration context.
NX-OS does not require you to do this - you can run a show command within any configuration context without the "do" keyword.
Read 12 tweets
4 Sep
Discovered an interesting issue at home today - when I ping a Nexus 9000v running in CML from an Ubuntu host, I see duplicate replies.
At first glance, you might think the Nexus is duplicating replies. Meaning, a single ICMP Echo Request packet enters the switch, and the Nexus sends two ICMP Echo Reply packets.
However, that's not the case - if you run Ethanalyzer on the mgmt0 interface of the Nexus, the Nexus sees two ICMP Echo Request packet enters with the same sequence number. Therefore, it generates two ICMP Echo Reply packets.

The Nexus is a victim of the problem, not the cause.
Read 32 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Too expensive? Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal Become our Patreon

Thank you for your support!

Follow Us on Twitter!

:(