Tweet

Chris Hart

4 Sep, 32 tweets, 10 min read

Discovered an interesting issue at home today - when I ping a Nexus 9000v running in CML from an Ubuntu host, I see duplicate replies.

At first glance, you might think the Nexus is duplicating replies. Meaning, a single ICMP Echo Request packet enters the switch, and the Nexus sends two ICMP Echo Reply packets.

However, that's not the case - if you run Ethanalyzer on the mgmt0 interface of the Nexus, the Nexus sees two ICMP Echo Request packet enters with the same sequence number. Therefore, it generates two ICMP Echo Reply packets.

The Nexus is a victim of the problem, not the cause.

If we tcpdump on the Ubuntu box's network interface filtered on packets destined to the Nexus *and* packets in the outbound direction, we can see the Ubuntu box is sending one ICMP Echo Request packet per ping. So the Ubuntu isn't responsible for duplicating these packets.

When I'm troubleshooting path of the packet issues like this, I sometimes like to follow a "cocktail shaker" approach to troubleshooting. I capture packets on one side of the flow (the source), then capture packets on the other side of the source (the destination).

We know that duplicate ICMP Echo Request packets are causing this issue. We don't know where the duplicate ICMP Echo Request packets are coming from. The flow looks good at the source, but is broken at the destination. That means something in the middle is broken.

The next "hop" in this flow is the VMware ESXi standard vSwitch that the source connects to. The world ID for the virtual machine is 2101747, per this output.

The Port ID that corresponds with this virtual machine is 50331670, per this output.

We can use this Port ID with ESXI's pktcap-uw command to capture packets that ingress the vSwitch from the virtual machine. In this command, I'm also filtering on IP Protocol 1, which is ICMP.

If the Ubuntu VM's tcpdump was lying to me and the Ubuntu VM *is* duplicating ICMP Echo Request packets, I'd expect to see more than three packets ingress when I run "ping 192.168.30.115 -c 3" on the VM. however, I only see three packets, which is good! The Ubuntu VM is innocent.

Next, let's move to the CML VM where the Nexus is running. World ID is 2101027, which resolves to Port ID 50331674. Let's capture packets leaving the vSwitch towards CML to prove the vSwitch's innocence.

My test ping that reproduces this issue isn't running, and yet this packet capture is capturing other "noise" on the wire. This means we need to filter our packet capture facing the CML VM a bit further.

Let's add the --mac parameter filtered on our Ubuntu VM's MAC address to fix this.

Now, if I send three ICMP Echo Request packets from my Ubuntu VM...

...I see *nine* in the packet capture. Ubuntu says it received 7 ICMP Echo Reply packets (including 4 duplicates) - however, the ping command exits when it receives the final Reply packet, not the duplicates.

This seems to suggest that the standard vSwitch is duplicating these ICMP Echo Request packets for some reason. We can confirm this by looking at the ICMP sequence number within pktcap-uw's output, which is highlighted here.

If this were a production network, our gut instinct would be to open a support case with VMware at this point.

However, let's think about this a bit harder.

A vSwitch is *not* a normal switch. A key difference is that vSwitches do not dynamically learn MAC addresses.

This means that when a packet from the Nexus 9000v running in CML enters the vSwitch, the vSwitch doesn't learn the source MAC address of that packet such that future packets destined to the Nexus will always egress the vSwitch port connected to the CML virtual machine.

Therefore, when the vSwitch receives a packet destined to a MAC address that doesn't correspond with a virtual machine connected to the vSwitch, the vSwitch will do what every switch does - it will unknown unicast flood that packet.

A key detail I've neglected to share until now is the fact that this vSwitch (vSwitch1) has multiple physical uplinks. vmnic1-3 connect to Gi1/0/14-16 of a Cisco Catalyst switch.

The Catalyst switch is a *real* switch, in that it *does* dynamically learn MAC addresses. Specifically, the Catalyst switch has learned the MAC of the Nexus 9000v switch (5254.000d.d27c) on Gi1/0/15.

When the vSwitch unknown unicast floods, it's most likely flooding the ICMP Echo Request packet out of either vmnic1 (connecting to Gi1/0/14) or vmnic3 (connecting to Gi1/0/16). The switch then forwards this packet back to the vSwitch via Gi1/0/15 (where the MAC is learned).

Luckily, we don't have to guess - we can use pktcap-uw to confirm this theory.

The --dir parameter controls what direction of packets we'd like to capture. A value of 0 captures packets received by the uplink port from the network, 1 captures packets sent by the uplink port.

Sure enough, when we ping, we see three ICMP Echo Request packets egress vmnic3 towards Gi1/0/16 of the Catalyst switch.

If we capture packets that ingress vmnic2 from the network connecting to Gi1/0/15 of the Catalyst (where the Nexus MAC is learned), we can see the same ICMP Echo Request packets enter the vSwitch once more.

If this were a normal network, this would be an endless loop. However, I'm assuming that VMware implemented a mechanism where packets that ingress a vSwitch via an uplink port cannot egress via any other uplink port. In the Cisco world, we call this a "Deja-Vu check".

There are two possible fixes I can see for this issue:

1. Remove all redundant uplink ports from the vSwitch, such that the vSwitch only has a single uplink port.
2. Migrate CML to its own dedicated vSwitch with its own dedicated uplink port.

I opted for the 2nd option so that the rest of my VMs still have some sort of redundancy.

Sure enough, the issue is resolved!

Now that we've figured out the root cause of the issue and implemented a fix, we have one last question to answer - "How was this working before? Why did this break?"

A few weeks ago, our air conditioning broke in the middle of a heat wave. In response, I powered off my lab for a few days while the air conditioning was being fixed.

My lab only has a single ESXi host, which boots from an internal flash drive.

After I powered the host back on, the flash drive had failed. I needed to replace it, reinstall ESXi, and import all of my virtual machines.

Prior to the outage, I used vCenter Server Appliance. After the outage, I decided not to use it anymore because I don't need it.

When I used VCSA, the ESXi host connected to the Catalyst switch through an LACP port-channel (which you need distributed vSwitches to do, which is unlocked through VCSA). After VCSA was removed, I opted to remove the port-channel from both sides in favor of independent uplinks.

With a port-channel, the Catalyst switch would perform the Deja-Vu check when it receives a flooded ICMP Echo Request packet and drop the packet in hardware. This behavior would prevent this issue from occurring. Once the port-channel was removed, this issue was introduced.

• • •

Missing some Tweet in this thread? You can try to force a refresh

Share this page!

Chris Hart

Try unrolling a thread yourself!

Did Thread Reader help you today?

Like this author's thread?