To optimize the cost, certain operators prefer an oversubscribed model in which a single central standby BNG (protecting BNG) supports multiple other BNGs in a semi-stateful fashion.
In Oversubscribed Multi-Chassis Redundancy (OMCR) model, many subscriber-hosts are backed up by a single central standby node. Standby subscriber-hosts within the protecting node are synchronized only within the control plane (CPM) in the form of a Multi-Chassis Synchronization (MCS) record. Such subscriber hosts are not instantiated in the data plane and therefore the data plane resources can be spared and used only on an as needed basis. This trait allows the protecting node to back up many subscribers that are scattered over multiple active BNG nodes at the expense of slower convergence.
Only a subset of the subscribers, up to the available resource capacity of the data plane in the protecting node, would be activated on the protecting node at any given time during the failure.
The failover trigger is based on SRRP (no MC-LAG support). The subscriber hosts under the corresponding group-interface is switched over once the SRRP instance on the protecting node transitions into the Master SRRP state.
There are two possible models for this deployment:
In both cases, a maximum of 64K subscribers per line card can be activated on the protecting BNG during the switchover. This is something that the operator should plan around, and consequently group the access nodes in a way so that the eventual number of active subscribers per line card on the protecting node does not exceed the maximum number of supported subscribers per line card.
Note: An deployment scenario can exist in which system-wide ESM capacity is oversubscribed but the line card capacity is not. For example, on chassis with 10 line cards, each line card can be reserved to protect a total host count of 64k. This would yield a total of 640k protected hosts distributed across the 10 cards but only up to 256k hosts could be activated simultaneously if it is required due to SRRP transitions to Master. |
The protection success of the OMCR model relies on grouping protected entities (links and nodes) according to the likelihood of their failure within the time frame required for their restoration. For example, the same resource (IOM card or port) on the protecting node can be used to protect multiple entities in the network if their failures do not overlap in time. In other words, if one failure can be repaired before the next one contending for the same resource on the central standby node, the OMCR model serves the purpose.
But since the oversubscribed model does not offer any guarantees, it is possible that the protecting node in certain cases runs out of resources and fails to offer protection. In this case, the protecting node generates an SNMP trap identifying the SRRP instance on which subscriber protection has failed. One SNMP trap is raised per SRRP instance in case where at least one subscriber under the corresponding group interface was not instantiated. The trap is cleared either when all subscribers become instantiated or when the SRRP transition into a non-master state.
The number of the subscriber hosts that failed to instantiate, can also be determined using the operational show>redundancy>multi-chassis all command. This command shows the number of subscribers that failed to instantiate along with SRRP instances on which the subscriber host are relaying for successful connectivity.
Pre-emption of already instantiated subscriber hosts in the protecting node by another subscriber hosts is not allowed.
Management and conservation of resources is of utmost importance in OMCR. The resources consumed by the subscriber host depend on the type and the size of subscriber parameters (the number of strings, length of strings, and so on).
For these reasons it is crucial that the operator has a view of the amount of memory in the CPM utilized by subscribers and the amount of free memory that can be used for additional subscribers. The MCS line is of interest in this output. In addition, the Subscriber Mgmt line shows memory utilization for active subscribers in the CPM.
The Available Memory gives an indication about how much memory remains.
For example:
Similar output is given regarding CPU utilization:
The protecting node operates in a warm-standby mode. Warm-standby mode of operation is a property of the entire node. In other words, while in the central-standby mode of operation (warm-standby command), only subscribers under the SRRP instances that are in the Master state is fully instantiated in the data plane on the central standby node (protecting node). All other subscribers (under the SRRP instances that are in the standby state) is synchronized only in the control plane. However, non-central standby nodes can have a peering connection with a protecting node (OMCR) and at the same time another peering connection with another active BNG node in active/active model. All nodes participating in the OMCR mode of operation must run SR OS Release 12.0 or higher. This model is shown in Figure 159.
The central backup property is configured with the following CLI:
The warm-standby keyword configures the chassis to be in the central standby mode of operation. Although the configuration option is configured per peer, the warm-standby functionality is applied per chassis.
Synchronization of IPoE subscribers (config>redundancy>multi-chassis>peer>sync>sub-mgmt ipoe) on the protecting node is only possible if all peers are configured for warm-standby or none are.
To transition from one mode to another (warm — hot), all peers must be administratively shutdown and the warm-standby keyword must be either removed or configured on all peers, depending on the direction of the transition.
Single-homed subscribers are supported in the central standby node, subject to resource limitations.
OMCR is supported only for IPoEv4/v6 subscribers. PPPoEv4/v6 subscriber hosts are not supported. However, non-synchronized PPPoE hosts can be hosted on the protecting node simultaneously with the protected IPoE subscribers. PPPoE PTA (locally terminated) non-synchronized subscribers and OMCR synchronized IPoE subscriber must not be configured under the same group-interfaces. On the other hand, non-synchronized PPPoE LAC sessions can be under the same group interface as the OMCR synchronized IPoE subscribers.
The recovery of PPPoE subscriber host in non-synchronized environment is based on the timeout of ppp-keepalives.
Persistency in the multi-chassis environment must be disabled since redundant nodes are protecting each other and they maintain up-to-date lease states. Otherwise, race conditions resulting in stale lease states caused by contention between MCS data and persistency data may occur.
Support for redundant interface is limited and can be used only in cases where subscribers are activated in the protecting node. In other words, the shunting over the redundant interface cannot be used if subscriber hosts are not fully instantiated (in the data and control plane). For this reason, downstream traffic must not be attracted (via routing) to the protecting node while the subscriber hosts are in the standby mode (SRRP is in a backup state).
During the transient period while the switchover is in progress, subscriber hosts are being instantiated or withdrawn (depending on the direction of the switchover) in the data plane on the protecting node. The duration of this process is dependent on the number of the hosts that needs to be instantiated/withdrawn and it is proportional to the regular host setup/tear-down rates. The redundant interface in this case can be used only for the hosts that are present in the data plane during the switchover transitioning period (from the moment that they are instantiated in the dataplane when protecting node is assuming activity, or up the moment when they are withdrawn from the data plane when the protecting node is relinquishing activity).
The following routing models are supported:
Case 1 — SRRP-Aware routing where subnets can be assigned per group-interfaces (SRRP instances). In a steady state, the redundant interface is not needed since the downstream traffic is attracted to the master node. During switchover periods (routing convergence transitioning periods), redundant interface can be used only for the subscriber hosts that are instantiated in the data plane (from the moment that they are instantiated in the dataplane when protecting node is assuming activity, or up the moment when they are withdrawn from the data plane when the protecting node is relinquishing activity). See Figure 160.
Case 2 — SRRP-Aware routing where subnets spawn (are aggregated) over multiple group-interfaces (SRRP instances). In case of a switchover, /32 IPv4 addresses and /64 IPv6 addresses/prefixes are advertised from the protecting node. In a steady state, the redundant interface is not needed since the downstream traffic is attracted via more specific routes (/32s and /64s) to the master node. During switchover periods (routing convergence transitioning periods), the redundant interface can be used only for the subscriber-hosts that are instantiated in the data plane (from the moment that they are instantiated in the dataplane when protecting node is assuming activity, or up the moment when they are withdrawn from the data plane when the protecting node is relinquishing activity). To reduce the number of routes on the network side, /32s and /64s should only be activated on the protecting node.
A deployment case that is not supported is the one where subnets spawn (are aggregated) over multiple group-interfaces (SRRP instances) and at the same time /32s are not allowed to be advertised from the protecting node. This scenario would require redundant interface support while subscriber-hosts are not necessarily instantiated in the protecting node.
In case that failure is repaired on the original active node (non-central standby node) while SRRP preemption (preempt) is configured, the corresponding active subscribers on the protecting node is withdrawn from the data plane and the activity (mastership) is switched to the original node.
This behavior ensures that the resources in the central backup are freed upon failure restoration and are available for protection of other entities in the network (other links/nodes).
In the preemption case, the upstream traffic is steered towards the newly active BNG via gratuitous ARP (GARP). In other words, the virtual MAC is advertised from the newly active node, and consequently the access and aggregation nodes update their Layer 2 forwarding entries. This action should cause no interruption in the upstream traffic.
In the downstream direction, the service interruption is equivalent to the time it takes to withdraw the routes from the network side on the standby node. In this case, there are two scenarios:
Service restoration times depends on the scale of the outage. The factors that affect the restoration times are:
When multiple srrp instances fail at the same time, they is processed one at the time on first come first serve basis. The subscriber instantiation processing during the switchover is divided into 1seconds intervals. In-between those intervals, the state of the SRRP are checked to ensure that it has not changed while the subscriber instantiation is in progress. This mechanism breaks the inertia (snowball effect) that can be caused by SRRP instance flaps. Furthermore, an SRRP flap is handled by not requesting a withdrawal followed by an instantiation request for the same SRRP instance.
The OMCR accounting follows the active/active (1:1) redundancy model.
One difference in accounting behavior between the OMCR model and 1:1 redundancy model is in the processing of the accounting session-time attribute which on the protecting node denotes the time when the host was instantiated on the protecting node.
In contrast, the session-time attribute in 1:1 redundancy model is recorded almost simultaneously on both BNG nodes at the time when the host is originally instantiated.
As a result, the session-time attribute is for the most part uninterrupted during the switchover in 1:1 model whereas in OMCR model, the session-time attribute is reset on the switchover to the protecting node.
Note: An SRRP switch from non-Master to Master in the protected node does not suffer the slow convergence observed when the non-Master to Master transition takes place in the protecting node. This is because the protected node always has the hosts instantiated in the data plane. |
Some of the commands that can assist in troubleshooting are listed below.
Note: To get a summary view of SRRPs and their OMCR status use the following command as shown below (the domain concept is reserved for future use): |
Note: To obtain specific SRRP OMCR information, OMCR information has been added to the show srrp x detail command: |
Note: To have a view of the MCS synchronization including OMCR standby records: |
Note: To have the MCS database view of the sync status including OMCR status use the following command syntax: |