ESM multi-chassis redundancy with PXC-based PW ports and EVPN VPWS

Redundant BNGs with EVPN VPWS in the access area of the network rely on the EVPN Single-Active (SA) multihoming concept with PW ports in Ethernet Segments (ES). A PW port on one side in the ES is elected as the Designated Forwarder (DF) and the other side as the non-Designated Forwarder (NDF). The ES with the PW port as DF is operationally up, and conversely, the ES with the PW port as NDF is operationally down. The DF side is the active side while the NDF side is the standby side. SRRP, as part of subscriber management redundancy scheme, indirectly tracks ES states to determine which BNG side is active and which is standby. With multiple EVPN VPWS instances, the load is distributed between the redundant BNGs where one BNG can be active for one set of EVPN VPWS while the other BNG can be active for another set of EVPN VPWS instances. The operator can influence the selection of the active side (DF side) for each EVPN VPWS by configuring a higher preference number on the preferred DF side.

config>service>system>bgp-evpn>eth-seg>service-carving$ manual preference <number>

In a typical ESM environment, a PW port contains thousands of PW SAPs, with each PW SAP representing a subscriber. To minimize the outage time during failures, the operator (through configuration) can optionally keep those PW SAPs operationally up even if the underlying PW port is in the NDF state. This reduces the failover time otherwise required to bring all the PW SAPs operationally up.

The SRRP state must transition into a standby state on the NDF side even if the PW-port is operationally up. To achieve this, the SRRP messaging PW SAP goes through an oper-group that tracks the state of the ES, whose operational state is up on the DF side and down on the NDF side

The basic concept of this approach, where the messaging SRRP PW-SAP is tracking the state of the ES, is shown in Figure: BNG multi-chassis redundancy with EVPN VPWS . There are two key concepts introduced:

The oper-up-on-mhstandby CLI flag ensures that the PW port is operationally up even while it is the NDF.
The SRRP messaging SAP is tracking the state of the corresponding ES through an oper-group (‟demo-ES2”). This ensures that the SRRP follows the activity state of the EVPN VPWS, while the PW port remains operationally up on both BNGs (active and standby).

Figure: BNG multi-chassis redundancy with EVPN VPWS

Within an SR node, a collection of separate entities work together to detect a failure in the network and divert the traffic around it. Those entities are:

ESM where subscribers are synchronized between the chassis
EVPN VPWS in the access area of the network
Routing that is advertising subscriber routes into the network
Oper-groups used to interconnect operational states between the entities
Various network failure detection mechanisms such as BFD to quickly detect failure path between BGP peers

The following is a detailed description of the setup with a single EVPN VPWS and two BNGs (Figure: BNG multi-chassis redundancy with EVPN VPWS ).

EVPN VPWS is configured as SA) multihoming. SA is crucial in ESM as it drives the SRRP state which must always be in an active or standby state between the redundant pair of BNGs.
BNGs are connected to the AN through an EVPN VPWS.
One BNG in the EVPN is selected as the DF (BNG1), the other BNG (BNG2) is the NDF.
BNG2 bring its ES down.
Only BNG1 advertises its AD route toward the access node.

Consequently, the AN does not send any traffic to BNG2 (NDF). Instead, the AN sends all traffic only to BNG1 (DF).
The ES is part of an oper-group (OG) which is monitored from the ESM side.
The stitching Epipe on BNG2 does not change its status. Neither does the PW port in it. The PW port stays up despite the MHStandby flag being raised. Normally, the MHStandby flag would cause the PW port to go down, but because of the oper-up-on-mhstandby configuration option, this behavior is overridden.
ESM subscribers are synchronized between the chassis through MCS and are using SRRP on the access side. With EVPN in the access, SRRP is not relying on its own keepalives to check the health of the network path, but instead, it follows the state of the PW port or the ES. If the PW port is operationally up, the messaging PW SAP is up, and therefore the SRRP is active. Conversely, if the PW port is operationally down, the messaging PW SAP is down and consequently SRRP is in the backup state. This is the expected behavior when the EVPN MPLS destination (network bind) goes down.

However, in the SA multihoming scenario, when the EVPN MPLS destination is not down, the PW port remains up even if the PW port is an NDF. Instead of relying on the PW port state, the SRRP messaging PW SAP monitors the state of the ES through an oper-group. When the oper-group changes its state to down, so does the SRRP messaging PW SAP, which then forces the SRRP into an INIT state (which is equivalent to a standby state).

On the network side, the state of the SRRP controls the advertisement of the subscriber IP routes into the network. Subscribers routes are advertised with a lower cost from the active SRRP node than they are from the standby SRRP node.
The solution described above protects against failures in the access part of the network or BNG node failure. Optionally, network side ports can be placed in an oper-group that can be monitored from the EVPN side. This can be used to protect against network port failures.