This chapter describes the components of EVPN-VXLAN Layer 2 on SR Linux.
Basic configuration of EVPN-VXLAN L2 on SR Linux consists of the following:
Example:
The following example shows a basic EVPN-VXLAN L2 configuration consisting of a vxlan-interface, MAC-VRF network-instance, and BGP-EVPN configuration:
EVPN Layer 2 (without multi-homing) includes the implementation of the BGP-EVPN address family and support for the following route types:
The MAC/IP route is used to convey the MAC and IP information of hosts connected to subinterfaces in the MAC-VRF. The IMET route is advertised as soon as bgp-evpn is enabled in the MAC-VRF; it has the following purpose:
Advertisement of the MAC/IP and IMET routes is configured on a per-MAC-VRF basis. The following example shows the default setting advertise true, which advertises MAC/IP and IMET routes.
Note that changing the setting of the advertise parameter and committing the change internally flaps the BGP instance.
Example:
The creation of VXLAN destinations of type unicast, unicast ES (Ethernet Segment), and multicast for each vxlan-interface is driven by the reception of EVPN routes.
The created unicast, unicast ES, and multicast VXLAN destinations are visible in state. Each destination is allocated a system-wide unique destination index and is an internal NHG-ID (next-hop group ID). The destination indexes for the VXLAN destinations are shown in the following example for destination 10.22.22.4, vni 1
The following is an example of dynamically created multicast destinations for a vxlan-interface:
When a MAC is received from multiple sources, the route is selected based on the priority listed in MAC selection. Learned and EVPN-learned routes have equal priority; the latest received route is selected.
When multiple EVPN-learned MAC/IP routes arrive for the same MAC but with a different key (for example, two routes for MAC M1 with different route-distinguishers), a selection is made based on the following priority:
You can configure the BGP next hop to be used for the EVPN routes advertised for a network-instance. This next hop is by default the IPv4 address configured in interface system 0.0 of the default network-instance. However, the next-hop address can be changed to any IPv4 address.
The system does not check that the configured IP address exists in the default network-instance. Any valid IP address can be used as next hop of the EVPN routes advertised for the network-instance, irrespective of its existence in any subinterface of the system. However, the receiver leaf nodes create their unicast, multicast and ES destinations to this advertised next-hop, so it is important that the configured next-hop is a valid IPv4 address that exists in the default network-instance.
When the system or loopback interface configured for the BGP next-hop is administratively disabled, EVPN still advertises the routes, as long as a valid IP address is available for the next-hop. However, received traffic on that interface is dropped.
Example:
The following example configures a BGP next hop to be used for the EVPN routes advertised for a network-instance.
MAC loop prevention in EVPN broadcast domains is based on the SR Linux MAC duplication feature (see MAC duplication detection and actions), but considers MACs that are learned via EVPN as well. The feature detects MAC duplication for MACs moving among bridge subinterfaces of the same MAC-VRF, as well as MACs moving between bridge subinterfaces and EVPN in the same MAC-VRF, but not for MACs moving from a VTEP to a different VTEP (via EVPN) in the same MAC-VRF.
Also, when a MAC is declared as duplicate, and the blackhole configuration option is added to the interface, then not only incoming frames on bridged subinterfaces are discarded if their MAC SA or DA match the blackhole MAC, but also frames encapsulated in VXLAN packets are discarded if their source MAC or destination MAC match the blackhole MAC in the mac-table.
When a MAC exceeds the allowed num-moves, the MAC is moved to a type duplicate (irrespective of the type of move: EVPN-to-local, local-to-local, local-to-EVPN), the EVPN application receives an update that advertises the MAC with a higher sequence number (which might trigger the duplication in other nodes). The “duplicate” MAC can be overwritten by a higher priority type, or flushed by the tools command (see Deleting entries from the bridge table).
SR Linux supports single-active multi-homing and all-active multi-homing, as defined in RFC 7432. The EVPN multi-homing implementation uses the following SR Linux features:
EVPN relies on three different procedures to handle multi-homing: DF election, split-horizon, and aliasing. DF election is relevant to single-active and all-active multi-homing; split-horizon and aliasing are relevant only to all-active multi-homing.
To support multi-homing, EVPN-VXLAN supports two additional route types:
Local bias for all-active multi-homing is based on the following behavior at the ingress and egress leafs:
In SR Linux, the local bias filtering entries on the egress leaf are added or removed based on the ES routes, and they are not modified by the removal of AD per EVI/ES routes. This may cause blackholes in the multi-homed CE for BUM traffic if the local subinterfaces are administratively disabled.
Figure 5 shows a single-active ES attached to two leaf nodes. In this configuration, the ES in single-active mode can be configured to do the following:
All peers in the ES must be configured with the same multi-homing mode; if the nodes are not configured consistently, the oper-multi-homing-mode in state will be single-active. From a hardware resource perspective, no local-bias-table entries are populated for ESes in single-active mode.
The following features work in conjunction with single-active mode:
Preference-based DF election is defined in draft-ietf-bess-evpn-pref-df and specifies a way for the ES peers to elect a DF based on a preference value (the highest preference-value wins). The draft-ietf-bess-evpn-pref-df document also defines a non-revertive mode, so that upon recovery of a former DF node, traffic does not revert back to the node. This is desirable in most cases to avoid double impact on the traffic (failure and recovery).
The configuration requires the command df-election/algorithm/type preference and the corresponding df-election/algorithm/preference-alg/preference-value. Optionally, you can set non-revertive mode to true. See EVPN multi-homing configuration example.
All of the peers in the ES should be configured with the same algorithm type. However, if that is not the case, all the peers will fall back to the default algorithm/oper-type.
AC-DF refers to the ability to modify the DF election candidate list for a network-instance based on the presence of the AD per-EVI and per-ES routes. When enabled (ac-df include command), a node cannot become DF if it has the ES subinterface in administratively disabled state. The AC-DF capability is defined in RFC 8584, and it is by default enabled.
The AC-DF capability should be disabled (ac-df exclude command) when single-active multi-homing is used and standby-signaling (lacp/power-off command) signals the non-DF state to the multi-homed CE/server. In this case, the same node must be DF for all the contained subinterfaces. Administratively disabling one subinterface does not cause a DF switchover for the network-instance if ac-df exclude is configured.
The AC-DF capability is configured with the command df-election/algorithm/preference-alg/capabilities/ac-df <include|exclude>; include is the default. See EVPN multi-homing configuration example.
Standby LACP-based or power-off signaling is used for cases where the AC-DF capability is excluded, and the DF election port-active mode is configured.
When single-active multi-homing is used and all subinterfaces on the node for the ES must be in DF or non-DF state, the multi-homed CE should not send traffic to the non-DF node. SR Linux supports two ways of signaling the non-DF state to the multi-homed CE: LACP standby or power-off.
Signaling the non-DF state is configured at the interface level, using the command interface/ethernet/standby-signaling <lacp|power-off>, and must also be enabled for a specific ES using the ethernet-segment/df-election/interface-standby-signaling-on-non-df <presence container> command. See EVPN multi-homing configuration example.
The LACP signaling method is only available on LAG interfaces with LACP enabled. When the node is in non-DF state, it uses an LACP out-of-synch notification (the synch bit is clear in the LACP PDUs) to signal the non-DF state to the CE. The CE then brings down LACP, and the system will not jump to the collecting-distributing state, and neither will the peer (due to out_of_sync). Once the port is out of standby mode, LACP needs to be re-established, and the forwarding ports need to be programmed after that.
The power-off signaling is available on Ethernet and LAG interfaces. When the node is in non-DF state, the interface goes oper-down, and the laser on the Ethernet interfaces (all members in case of a LAG) are turned off. This brings the CE interface down and avoids any traffic on the link. The interfaces state show oper-state down and oper-down-reason standby-signaling.
After the system boots, the reload-delay timer keeps an interface shut down with the laser off for a configured amount of time until connectivity with the rest of network is established. When applied to an access multi-homed interface (typically an Ethernet Segment interface), this delay can prevent black-holing traffic coming from the multi-homed server or CE.
In EVPN multi-homing scenarios, if one leaf in the ES peer group is rebooting, coming up after an upgrade or a failure, it is important for the ES interface not to become active until after the node is ready to forward traffic to the core. If the ES interface comes up too quickly and the node has not programmed its forwarding tables yet, traffic from the server is black-holed. To prevent this from happening, you can configure a reload-delay timer on the ES interface so that the interface does not become active until after network connectivity is established.
When a reload-delay timer is configured, the interface port is shut down and the laser is turned off from the time that the system determines the interface state following a reboot or reload of the XDP process, until the number of seconds specified in the reload-delay timer elapse.
The reload-delay timer is only supported on Ethernet interfaces that are not enabled with breakout mode. For a multi-homed LAG interface, the reload-delay timer should be configured on all the interface members. The reload-delay timer can be from 1–86,400 seconds. There is no default value; if not configured for an interface, there is no reload-delay timer.
Only ES interfaces should be configured with a non-zero reload-delay timer. Single-homed interfaces and network interfaces (used to forward VXLAN traffic) should not have a reload-delay timer configured.
Example:
The following example sets the reload-delay timer for an interface to 20 seconds. The timer starts following a system reboot or when the IMM is reconnected, and the system determines the interface state. During the timer period, the interface is deactivated and the port laser is inactive.
When the reload-delay timer is running, the port-oper-down-reason for the port is shown as interface-reload-timer-active. The reload-delay-expires state indicates the amount of time remaining until the port becomes active. For example:
The following is an example of a single-active multi-homing configuration, including standby signaling power-off, AC-DF capability, and preference-based DF algorithm.
The following configures power-off signaling and the reload delay timer for an interface:
The following configures DF election settings for the ES, including preference-based DF election and a preference value for the DF election alg. The ac-df setting is set to exclude, which disables the AC-DF capability. The non-revertive option is enabled, which prevents traffic from reverting back to a former DF node when the node reconnects to the network.
The following shows the state (and consequently the configuration) of an ES for single-active multi-homing and indicates the default settings for the algorithm/oper-type. All of the peers in the ES should be configured with the same algorithm type. However, if that is not the case, all the peers will fall back to the default algorithm.
To display information about the ES, use the show system network-instance ethernet-segments command. For example:
The detail option displays additional information about the ES. For example:
On the DF node, the info from state command displays the following:
On the non-DF node, the info from state command displays the following: