Control plan learning using static/Multicast/EVPNs-Type 2

Forming vxlan tunnel using static entry


  1. Assuming L1 and L4 have configured with static arp entry ie.. .L1 and L4 have learnt respective MAC information.
  2. When tries to reach; L1 would encapsulate and forward it to L4, check out how Switch L1 formats the frame


Therefore, we have tunnel end points in 100.100.100.X and underlay communicating with each other.

As the network grows, it becomes difficult to maintain and configure all static tunnel endpoint, therefore we can use multicast to scale and learn thousands of remote VXLAN tunnels. Let’s explore Mcast to build VXLAN tunnel endpoints.

Same topology, let’s explore  using Mcast to learn tunnel endpoints!


  1. Assuming we have enabled multicast between spine and leaf topology.
  2. When tries to reach, we need to establish tunnel endpoint between L1 and L4 (in our case and
  3. Below mcast capture shows how would learn remote MAC address of L4 using destination multicast IP and 01:00:5e:00:0058 Mcast MAC address.


It’s very important to note from above capture, Leaf L1 100.100.100.X network is                using destination multicast ip to learn remote MAC address of leaf L4…            shown in above capture….please note outer MAC/IP headers.

   4. Once L1 update L4 Mac address, then it would encapsulate data from 2.2.2.x                        network  into 100.100.100.x, Please note the MAC address of outer header …..At this            stage we can see leaf L1 learning(10:00:00:10:00:11) remote MAC address of L4….as            shown below….Please check outer IP/MAC headers.


     5. Below capture on leaf L1 shows arp reply coming from (host attached to              L4 with mac 00:00:10:01:00:01).                                      1

In nutshell, we can see outer IP mac address are learnt via Multicast, hence VXLAN tunnel being formed between L1 and L4 we can inner IPs (2.2.2.x) communicating to each other!

As network grows larger, it’s not great idea to flood and update MAC address to build tunnel endpoints. The next generation tunnel endpoints are constructed using MP-BGP

MP-BGP: EVPN to learn remote MAC address and construct VXLAN tunnel


The same old host would try to reach from,

For EVPNs we have 5 different types control plan messages, that would help to solve different use case, Below Control plan message is type-2 and particularly used to carry underlay IP information. (In large scale network Spine/Leaf would know all VM prefix).

Lets look into brief EVPN-Type2 exchanges

  1. We assume leaf and spine are configured perfectly, IBGP session is up after BGP open message are exchanged between L1 and L4. Therefore, in this scenario when BGP session comes up that means L1 or L4 have learn respective MAC-address.
  2. MP-BGP EVPN type-2 (IP:MAC) are used to learn underlay IP prefix, These prefix in our case and attached to switch port.

Once we have IBGP session established, technically these MAC address would be                used by switch to encapsulate VXLAN traffic, From below BGP update message we              can see  that are leant by network using BGP


4. After learning control plan, when communicates to, please find the          below Wire-shark flow; Note we have outer destination mac address 10:00:00:10:00:11      was learn when IBGP session was established.


Overall, Its all about learning remote mac address to form VXLAN tunnels and use this information to encapsulate inner IP information. I believe the best way is to use multicast as its easy, Only problem is spine/leaf hardware switches will never learn these prefix.

In contrast in most spine/Leaf, we have IBGP session to form loop free topology, therefore it make sense to use evpns that would help to explore underlay prefix. In this way network engineer would know all overlay/underlay prefix and the same information could be used to built smart analytics.



In 2016, Our old BGP is IGP.

Hence BGP between Spine and leaf boxes. We can always use OSPF/ISIS between spine and leaf, but we all know BGP can scale whole internet prefix therefore it can scale any large data centers that have thousands of servers.

Well there are many things and different technique that BGP needs to tweaked such as

  1. Do I use ebgp/ibgp
  2. Trick of ASN numbers that makes BGP so friendly to configure.
  3. Convergence is slow on bgp.
  4. BGP is complex to troubleshoot issues.’
  5. BGP is complex to configuration as many parameters can be configured.

Most of them are addressed and we have many docs addressing above problem. I am going to point one below problem and how BGP overcomes the issue,

Let’s start with below problem, we have spine/leaf topology and below is one of the issue, Host on right side wants to reach;

We have two hosts attached to leaf and they are using as their IPaddress, why would someone do that?


Well consider a server that’s running load balancer as VM and it has applications running, now we have two physical servers and hence two load balancers running and application hosted on two servers…Therefore we have same IP coming from two physical servers and been learned by leaf.

When leaf advertises prefix, this is coming from SW1 and SW2 … we know leaf switch uplink ports are connected to every Spine, this effectively means all four Spine will only learn one prefix from SW1 or SW2 and one is going to be dropped.

When BGP update message comes to spine switch from leaf in this case from SW1 and SW2, When I see the routing table inside Spine….we can see only one route ending up and leaving behind one prefix. Hence any host that wants to reach prefix we may assume only one server is reachable. (Assuming that SW1 – prefix made it inside Spine switches and dropping SW2) …BGP running on spine side would pick only one prefix due to best path selection.

Therefore, this needs to be solved.

When a BGP update message is generated by leaf and this would contain prefix plus AS_identity and AS_PATH …enable of BGP option best path needs to be considered with AS_PATH instead of AS Identity will solve the problem.

By configuration of BGP with AS_PATH would allow to learn both SW1 and SW2 prefix inside routing table of spine. Effective ECMP between S/L would allow traffic to hit both Prefix.

Please check the respective vendor command to overcome above issue.



L2 Gre encapsulation

It so much confuses between layer2/layer 3 GRE

A layer 3 GRE tunnel…

Below topology with two routers we will build the tunnel and check how frame gets encap and decap as it passes from R1 to R2



Here in this case 13.x.x.x network is data that rides under the tunnel , as from Wireshark trace we can see that application traffic come with 13.x.x.x network, which is encapsulated with tunnel source and destined to with GRE header.

On Router 2 when the traffic with destination is arrived, thats the ip address belonging  to tunnel, so the destination IP is chopped in effective it de-capsulate outer ip header ( leaving only application traffic. We can see the ping reply from application to…..


A Layer 3 GRE will have a protocol type =IP, as shown below


Glimpse into Layer 2 GRE

First thing that comes into mind for L2GRE is that tunnels will establish control plane using source and destination MAC, and it would rule out IP for bringing up tunnels between two boxes. That’s not the case we still need the IP address with GRE encap for tunnel to come up.

When we see inside the GRE encap you can find protocol type – Transparent Ethernet Bridge. Check out the below flow for L2GRE or EoGRE….



when R2 constructs the Eo-GRE or L2-GRE frame; it has …..Data IP(13.13.13.x) +MAC+GRE + outerIP( ie when frame is generated with transparent ethernet bridge it also adds up mac information, the same is not added in L3 GRE (Only IP is seen)

That`s the difference when Eogre frames are generated by node, it should construct the frame with Mac+ IP + GRE + Tunnel in contrast L3Gre host would construct IP +GRE +tunnel (No MAC information).

Most of the used cases for L3 GRE , IP tunnel would be established between the routers in contrast L2 Gre tunnels are established between the CPE /MODEM and router.




Micro-burst traffic pattern

Cut through switches provides ultra low latency as much as 200 nsec to 600 nsec .when you build a data center with ultra-low latency switches this results in super ultra-low latency data center or loss-less Ethernet data center.

Store and forward switches provide deep buffers and low latency. When you build a data center with S&F switches what you get is a no loss drop frame, so an application using TCP will not tear-down and a result is loss-less application.

Data center with above switches are ready to handle micro-burst?

OH yes, why not.

What’s micro-burst, these are the frames that are seen in network which are smaller in size, when you see these frames inside a cable reaching a switch port generated by initiator or server they are typically 30 to 40 frames with size by 200 bytes to 300 bytes. Yes that’s the micro-burst.

Apart from microburst we have a traffic pattern with normal burst with 1000 bytes frame size approximate and are traveling at constant rate.

Both normal bursts and microburst can cause congestion. When we see congestion we are aware frames would get dropped or more latency etc…..

A measure different is that micro burst are smaller frames and are present in network for a short duration, this results in networking mgmt tool not to detect and report. On other side normal burst are detected by mgmt tools.

How do ultra-low latency and Store and forward switches help to build loss-less Ethernet – Consider the below topology,


From above topology all 31 ports connected to server are sender and port number 32 is target. When all 1 to 31 servers try to reach 32 server how does a switch react to micro burst or normal burst? Congestion would accrue for any kind of traffic.

What we need to understand is switch architecture…..How much amount of memory is allocated per switch port to queue the frames as they arrive, or is there any centralized memory provided so that congested ports can use the pool and free the memory as required.

Both kind of switch architecture are available and both can do the justification when congestion is accrued because of microburst or normal burst.

When switch queue the frames we can see latency for the respective application and we would never see TCP for the respective application getting timed out as these frames would be delayed not dropped.


Congestion for switch ports is normal, because of microburst traffic or normal traffic, Microburst traffic are not detected by network mgt tools as they come and go soon.

When we decide to build a loss less ethernet it’s important to know how switches handle congestion. What’s the architecture for cut through or store and forward? What’s the per port memory allocation.

Best Regards


Priority Flow Control understanding

Priority Flow control

*S= Sender NIC on server
*R = Receiver port on remote side and its switch port.

S ——————– R

When R wants to sends PFC packet towards S, S would slow down the traffic rate…..

A.Calculating Bits and time taken by R to put PAUSE frame on wire that’s when PFC demon is called by switch once it sees the congestion.
B. Time and bits stored in cables before sender stops sending traffic
C. Transceiver latency.
D. Before S takes the decision to slow the traffic what needs to be considered

A. Calculating Bits and time taken by R to put PAUSE frame on wire

In the worst case,
R generates a PAUSE frame right when the first bit of a maximum-size frame MTU[R] has started engaging the PFC demon logic
or calls internal system to begin the process because of congestion.

PAUSE frame is delayed until Maximum size packet that being processed and finish.
It can belong to any CoS, and we’ll need to account for the largest MTU configured across all CoSs.

B. Calculating Amount of time and bits stored in cables before sender stops sending traffic

Every 100 Meter cable adds 476 n sec for copper and 513 nsec for single-mode fiber

For 100 Meter cable length with port speed 10Gbps

10Gig bits –per sec
X gig –per 479 nSec

X = 10Gig * 476 nSec = 4760 Bits =595 bytes

Still this pfc packet starts it journey from R towards S we have other 595 bytes inside wire.

Amount of data “stored” in the cable when the PAUSE frame is injected into wire is 595 bytes. Once this PaUSE frame reaches S another 595 Bytes.
Therfore its 1282

******************************************************************C. For this we are ignoring transceiver latency ******************************************************************

D. How Much amount of time and bits does this S takes before it makes decision to drop the traffic rate?

When PFC reaches S,
PFC demon is triggered, control plane PFC is processed to stops the traffic or slows the traffic.
before this task is triggered there would be frames still needs to transmitted…….
For any implementation PFC definition points to have 60 Quanta.

1 quantum represents the time needed to transmit 512 bits at the current network speed….by PFC definition 60 quanta are needed to process the

packets which yields other

60*512 =30,720 bits (3840 bytes).

Time taken by the process if network speed is 10GBPS

10 Gig Bits —- 1 sec
512 Bits —- X

Therefor X = 51.2 nsec for process PFC and drop the traffic rate at S

Therefore before PFC gets into action we should add the above bits

A + B + C + D
A + B + 1282 + 3860 Bytes

This would summary how many bytes of data would be present inside the network when PFC control plane hand shake happens between R and S

Good information @ Cisco white paper for testing PFC

Best Regards

Ravi Patil