A routing protocol is needed to dynamically discover the shortest loop-free path through the underlay of a DC fabric to reach every destination IP subnet. The Border Gateway Protocol (BGP) is one of the leading technologies for this purpose due to its simplicity, scalability, and ease of multi-vendor interoperability.
BGP also provides policy mechanisms to perform hop-by-hop traffic engineering, leveraging functionality originally designed for this same purpose in the public Internet.
The information and configuration in this chapter are based on SR Linux Release 19.11 and later.
Figure 1 shows a 3-stage CLOS fabric design using only BGP for underlay routing.
The design example in Figure 1 shows the following:
Using BGP as shown in the Figure 1 example has these advantages:
The following example defines how to bring up a static, pre-configured EBGP session between Router 3 and Router 5 as shown in Figure 1. Use the following two procedures to define the minimum configuration required for each router:
To configure Router 3:
To configure Router 5:
When two BGP routers form a session, they each propose a value for the session hold-time in their OPEN messages. The lowest of the two proposed values becomes the operational hold-time for the lifetime of the session. If the operational hold-time is greater than zero, then both routers are agreeing to send KEEPALIVE messages to each other. This ensures that any loss of connectivity between them can be detected.
Each router restarts its hold-timer every time it receives a message from the other peer. If the operational hold-timer reaches zero without receiving any KEEPALIVE or related message from the peer, then the session is torn down (returned to the Idle state). Each router sends a KEEPALIVE message to its peer no more than 1 message every keepalive interval. The default value for the keepalive interval is one third of the operational hold-time, but it possible to configure a different interval.
In a data center environment, an EBGP session failure is usually caused by an interface going down. Interface events are propagated to BGP if fast-failover is enabled. The hold-timer expiry is not the usual mechanism for detecting connectivity problems. However, there may be some circumstances where some adjustment of the hold-time and/or the keepalive-interval is desired.
With the SR Linux, the default hold-time is 90 seconds and the default keepalive interval is 30 seconds. To change the hold-time on a session to 24 seconds with a keepalive interval of 8 seconds (1/3 of 24) it would be sufficient to change the hold-time value to 24, as shown in the following example:
After this change is committed, the affected session will flap and the new operational timer values are seen in the output of the show network-instance protocols bgp neighbor detail report. For example:
By default, the SR Linux BGP process (running the BGP control plane) does not advertise a route for an IPv4 or IPv6 prefix until it has positive confirmation from the FIB manager process that the route is installed into the FIB of all installed line cards. This ensures that the router does not attract traffic destined for an IP prefix until all line cards have the ability to forward the traffic. Note that the BGP process does not delay route withdrawals until it knows that all line cards have removed the FIB state as this is not needed.
It is recommended that the wait-for-fib-install feature remain enabled on routers that are in the datapath (that is, routers that set BGP next-hop-self). However, this does cause the rate of RIB-OUT route advertisements to slow to the rate of FIB programming. If the objective of a BGP performance test is to reach the highest possible route advertisement rate, the wait-for-fib-install configuration leaf can be set to false. For example:
The BGP protocol and its state machine must attempt to re-converge whenever the following occurs:
When any of these conditions are met, the router re-synchronizes its BGP RIB with the BGP RIB of other routers in the network. Once re-synchronization completes, BGP has "converged". During convergence, the following occurs to the restarting router:
The default behavior of SR Linux BGP is to execute all of the above steps in parallel. As soon as the first BGP session has re-established, the restarting router begins to advertise its own best paths to that BGP neighbor (even though it is still in the early stages of rebuilding its RIB-IN database).
As more sessions come up and more routes are learned, it is likely that routes previously considered best are no longer best, leading to multiple route advertisements for the same prefix with each incrementally better than the previous one. The best route is not determined until the last advertisement. The intermediate route advertisements can substantially increase the processing workload on the restarting router as well as its BGP neighbors. This can lengthen the overall convergence time and cause short term inefficiencies in traffic forwarding.
Instead of re-converging as described above, SR Linux BGP can also be configured to delay the advertisement of BGP routes in a particular address family until convergence has occurred for that address family or until a configured time limit has expired. This behavior is activated by configuring a non-zero value for the min-wait-to-advertise configuration leaf. For example:
The max-wait-to-advertise leaf value for the IPv4-unicast and IPv6-unicast address families can be configured, or you can accept their default values (3x the min-wait-to-advertise value). If configuring a max-wait-to-advertise leaf with a non-default value, the value must be greater than the configured min-wait-to-advertise timer. In the example below, the max-wait-to-advertise timer is set to 900 seconds for IPv4-unicast and set to 800 seconds for IPv6-unicast.
The min-wait-to-advertise timer begins after one of the following triggers occurs and the first BGP session becomes established.
When the first session that supports the exchange of IPv4-unicast routes is established, the max-wait-to-advertise timer of the IPv4 address family starts. Likewise, when the first session that supports the exchange of IPv6-unicast routes is established, the max-wait-to-advertise timer of the IPv6 address family starts.
While the min-wait-to-advertise timer is running, BGP sessions come up, and routes are learned and sorted according to preference by the BGP decision process. However, no routes are advertised to any of the peers.
When the min-wait-to-advertise expires, BGP makes a list of IPv4 and IPv6 peers (that is, peers that support the exchange of IPv4-unicast routes and IPv6-unicast routes). It expects to receive the IPv4-unicast End of RIB (EOR) marker from each neighbor in the list of IPv4 peers, and it expects to receive the IPv6-unicast EOR from each neighbor in the list of IPv6 peers.
When BGP in the restarting router receives the last expected IPv4-unicast EOR, it declares that address family as converged and starts to advertise its best IPv4-unicast routes. Likewise, when BGP receives the last expected IPv6-unicast EOR it i declares that address family as converged and starts to advertise its best IPv6-unicast routes.
If the max-wait-to-advertise timer expires before the last expected EOR is received for an address family, the convergence state for the address family moves to 'timeout' and RIB-OUT advertisement is triggered. This occurs even though convergence is not complete. The max-wait-to-advertise timers are a fail-safe. They handle the scenario when one or more peers come up within the min-wait-to-advertise window, but their EORs are not sent.
In the example that follows, the BGP convergence process is triggered by a hard reset of all peers of the BGP instance:
If the show network-instance protocols bgp summary command is issued a few minutes after the session restarts, a snapshot of the convergence process can be viewed. For example, in the sample output below, ten IPv4-unicast sessions are established when the min-wait-to-advertise timer expires and IPv4-unicast convergence takes 517 seconds.
Some data centers are migrating away from an IPv4/IPv6 dual-stack infrastructure and moving towards an IPv6-only infrastructure. In an IPv6-only design, each interface in the fabric (such as the leaf-spine, leaf-TOR) is assigned one or more IPv6 addresses, but no IPv4 addresses.
In order to route and forward IPv4 packets over an IPv6-only fabric, the leaf and spine switches must support the following:
In the SR Linux, the ability to advertise a BGP route for an IPv4 NLRI with an IPv6 BGP next-hop address is not enabled by default. To enable, use the advertise-ipv6-next-hops command, which is available on a per session basis. The following is a sample configuration:
In the SR Linux, the ability to receive a BGP route for an IPv4 NLRI with an IPv6 BGP next-hop address is not enabled by default. To enable, use the receive-ipv6-next-hops command, which is available on a per session basis.
This command allows SR Linux to advertise the extended-next-hop-encoding BGP capability, defined in RFC 5549, to the peers included in the scope of the command. This BGP capability encodes NLRI AFI 1, NLRI SAFI 1, and next-hop AFI 2. It informs peers that they can advertise MP-BGP encoded IPv4 routes with IPv6 next-hops. When the routes are received, the router will then attempt to resolve them using IPv6 routes.
If the router receives an IPv4 route with an IPv6 next-hop that is resolved by a static or direct IPv6 route (and an IPv6 neighbor entry for the next-hop host address), the IPv4 route is programmed in the FIB so that matching IPv4 packets are sent without additional encapsulation. Packets are sent through the indicated interface with a MAC destination address provided by the IPv6 neighbor entry.
The following is a sample configuration:
The datapath of the SR Linux discards all IPv4 packets that are received on an IPv6-only subinterface (that is, a subinterface with no configured IPv4 addresses). This is done for security reasons. However, if the router has advertised IPv4 routes with IPv6 next-hops to a peer, the check should be disabled on all subinterfaces that could be used by the peer when it installs the IPv4 route.
To disable this check on all subinterfaces bound to a particular network-instance, set the ipv4-receive-check leaf to false.
The following is a sample configuration:
All BGP-related CLI commands can be found in the SR Linux Data Model Reference.