Christopher Hart Profile picture
Oct 17, 2021 β€’ 19 tweets β€’ 6 min read β€’ Read on X
"I see a lot of packet loss when I ping my switch" 🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩🚩

Wait, why is this a red flag? Let's dig into this behavior in a bit more detail... 🧡
First, let's take a look at our topology. We have two hosts in different subnets that connect to a Cisco Nexus 9000. One host connects via Ethernet1/1, and the other connects via Ethernet1/2. Ethernet1/1 has an IP of 192.168.10.1, while Ethernet1/2 owns 192.168.20.1.
The architecture of most network devices has three "planes" - a data plane, a control plane, and a management plane. We'll focus on the first two. The data plane handles traffic going *through* the device, while the control plane handles traffic going *to* the device.
If we visualize the data plane and control plane of our switch within our topology, it would look like this. Notice how the control plane connects to the data plane through an inband interface. Also notice how the control plane hosts various software processes, such as ICMP.
ICMP traffic between the two hosts flows through the data plane of the switch. This makes sense, because traffic between the two hosts will go *through* the switch - it is not destined *to* the switch.
However, what if the switch gets an ICMP Echo Request packet destined to itself (e.g. 192.168.10.1)? The data plane will recognize that the switch itself owns IP 192.168.10.1 and forward the packet to the control plane's inband interface. This action is called a "punt".
When the control plane receives this packet through the inband interface, it will inspect it and "route" it to the ICMP software process so that the ICMP process can handle it accordingly.
The ICMP software process will most likely generate an ICMP Echo Reply packet, which will be sent to the control plane's inband interface, which is dequeued by the data plane and forwarded back out of Ethernet1/1 towards the host.
However, what if the switch was receiving a *lot* of ICMP traffic all at once? For example, a malicious actor may be sending the switch more ICMP traffic than it can handle, or maybe a network monitoring system (NMS) is aggressively monitoring switch reachability through ICMP.
This could clog the inband interface and control plane with unnecessary traffic or cause high CPU utilization, which would inadvertently affect other control plane protocols (such as BGP, OSPF, Spanning Tree Protocol, etc.) and cause instability in the network.
This is where the concept of "Control Plane Protection" comes in. We need a mechanism to rate limit the amount of control plane traffic sent to a network device so that the control plane of the network device does not get overwhelmed with traffic.
On Cisco Nexus switches, Control Plane Protection is primarily implemented through "Control Plane Policing" - a feature better known by its acronym, CoPP. This is enabled by default, but you can confirm it's configured with the "show copp status" command.
CoPP is implemented within the data plane of the switch and enables the data plane to drop a specific class of traffic if the rate of traffic for that class exceeds a threshold. It's essentially a QoS (Quality of Service) policer for the control plane.
The output of the "show policy-map interface control-plane" command can show us how classes in the CoPP policy are organized, what type of traffic corresponds with each class, and each class's policer rate.
In our scenario, the "copp-system-p-class-monitoring" CoPP class handles ICMP traffic. We can see that by default, a 360 kbps CIR (Committed Information Rate) of ICMP traffic is allowed with a 128 kilobyte Bc (committed burst).
We can also see that 32 megabytes of ICMP traffic has been allowed by this CoPP class, while about 279 megabytes of ICMP traffic has been dropped by this CoPP class since the last time the counters were cleared. Clearly, this switch is being *blasted* with ICMP traffic!
As it turns out, I had another SSH session open to my host that was mindlessly slamming the Nexus with ICMP traffic. Oops, silly me! Thankfully, this isn't production! πŸ˜…
Now that I've stopped the other ping command, we can see that my original ping works as expected with no packet loss.
What's the moral of this story? There is a fundamental difference between data plane traffic and control plane traffic. Intermittent packet loss observed when pinging a network device is not necessarily symptomatic of packet loss observed when pinging hosts on the network.

β€’ β€’ β€’

Missing some Tweet in this thread? You can try to force a refresh
γ€€

Keep Current with Christopher Hart

Christopher Hart Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @_ChrisJHart

Mar 21, 2022
Lots of things can ruin the average person's Christmas holiday. In 2019, one network engineering team ruined their Christmas by combining IP SLA operations, track objects, and static routes.

Let's find out how! 🧡

Prefer a blog post format? Click here: chrisjhart.com/TAC-Tales-How-…
It was Christmas Day of 2019, and I was working the holiday shift in Cisco TAC. Working Christmas is enjoyable - it tends to be quiet, and in the rare case you need to assist with an issue or outage, customers are nice and in good spirits.
On this day, a case came in requesting a Root Cause Analysis (RCA) for an outage that happened a few hours ago. The outage lasted about 23 minutes in length, and the environment recovered on its own.
Read 56 tweets
Jan 23, 2022
A #network administrator's worst nightmare can be intermittent network congestion - it's impossible to predict, short-lived, and has major impact. Can #Python help us find and fix it?

Let's find out! 🧡

Prefer a blog post format? Click here: chrisjhart.com/Practical-Pyth…
A case I've seen in TAC is where customers observe intermittently incrementing input discard counters on interfaces of a Nexus 5500 switch. This is usually followed by reports of connectivity issues, packet loss, or application latency for traffic flows traversing the switch.
Oftentimes, this issue is highly intermittent and unpredictable. Once every few hours (or sometimes even days), the input discards will increment over the course of a few dozen seconds, then stop.
Read 37 tweets
Jan 23, 2022
Excellent thread from Nick on this topic! A big point I'm a fan of:

"...most juniors don't have an immediately accessible lab on their laptops or cloud environment, because they don't spend much time labbing. Most mid-levels can spin up a topology on demand."
Labbing something does not have to be a arduous, time-intensive process. Being familiar with lab resources available to you and knowing how to efficiently use them is paramount to getting definitive answers to questions quickly.
For example, let's say somebody asks me whether changing the MTU on Layer 3 interfaces between two routers causes an OSPF adjacency between both routers to immediately flap.
Read 7 tweets
Oct 24, 2021
A common misunderstanding engineers have about Equal-Cost Multi-Pathing (ECMP) and port-channels is that they increase the bandwidth that can be used between two network devices. This *can* be true, but isn't *always* true.

Curious why? 🧡
First, let's review our topology. Three Cisco Nexus switches are connected in series. Traffic generators are connected to Switch-1 and Switch-2 through physical interface Ethernet1/36. Switch-1 and Switch-2 connect to Router through Layer 2 port-channels.
As the names suggest, Switch-1 and Switch-2 are purely Layer 2 switches. Router is a router that routes between two networks - 192.168.1.0/24, and 192.168.2.0/24. The traffic generator mimics four hosts - two in 192.168.1.0/24, two in 192.168.2.0/24.
Read 35 tweets
Oct 15, 2021
On Cisco Nexus switches in production environments, avoid working within a configuration context on the CLI unless you're actively configuring the switch. Otherwise, you might accidentally cause an outage by trying to run a show command.

Curious how that's possible? 🧡
Cisco IOS and IOS-XE require you to prepend show commands with the "do" keyword to execute them within a configuration context.
NX-OS does not require you to do this - you can run a show command within any configuration context without the "do" keyword.
Read 12 tweets
Sep 4, 2021
Discovered an interesting issue at home today - when I ping a Nexus 9000v running in CML from an Ubuntu host, I see duplicate replies.
At first glance, you might think the Nexus is duplicating replies. Meaning, a single ICMP Echo Request packet enters the switch, and the Nexus sends two ICMP Echo Reply packets.
However, that's not the case - if you run Ethanalyzer on the mgmt0 interface of the Nexus, the Nexus sees two ICMP Echo Request packet enters with the same sequence number. Therefore, it generates two ICMP Echo Reply packets.

The Nexus is a victim of the problem, not the cause.
Read 32 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(