Health status and failure events

A health value determines the activity of a NAT group within a pair of redundant nodes. The health value of a NAT group is internally calculated. The system can automatically decrease this value depending on the events that can negatively affect the system’s ability to perform NAT at a needed capacity.

A NAT group with a higher health value becomes active.

Table: Activity states at equal health shows activity states, if paired NAT groups have equal health values on both nodes. Preferred is a configuration parameter that influences the activity state for a pair of NAT groups with equal health value (typical use case would be load balancing per NAT group).

Table: Activity states at equal health
Node 1	Node 2	Active node	Comments
no preferred configured	no preferred configured	Whichever node becomes active first, remains active	If both nodes are becoming active simultaneously, the node with the highest system chassis MAC address becomes a controller node that decides which node becomes active node and which standby based on the health and preference values. When the health and preference are equal, the controller node does not preempt (trigger a switchover) an already active node.
preferred configured	no preferred configured	Node 1	Node 1 always preempts Node 2 (if the health values are equal)
no preferred configured	preferred configured	Node 2	Node 2 always preempts Node 1 (if the health values are equal)
preferred configured	preferred configured	Whichever node becomes active first, remains active	Same as for no preference on both nodes

The health parameter is initially set to a value of 1000 under the following circumstances:

The number of active ISAs in a NAT group is matching the configured value for the active-mda-limit configuration parameter.
There are no port failures that are being monitored.
There are no failures within the operation groups that are being monitored.

The above circumstances imply that the system is fully operational with no failures that would affect NAT operation.

However, the health value can be influenced by the events that can affect NAT operation, and that are outside of ISA-related failures, for example, unhealthy ports and paths that lead traffic in and out of the NAT node. Such events are explicitly tracked or monitored for the purpose of dynamically adjusting the health value and therefore influencing the activity of the NAT groups.

Stateful inter-chassis NAT redundancy protects against the following failures:

Nodal failure; if the active node fails, the standby node is notified of such an event by the lack of received keepalives.
ISA failure; a NAT group must have exactly the active-mda-limit number of ISAs that are operationally up to participate in stateful inter-chassis NAT redundancy. If the number of operational ISAs falls below the configured limit, then the health of the NAT group drops to 0.
Ports on the node can be monitored, and their operational state can trigger a change of the health value.
BFD sessions on the node can be monitored using the oper-groups and their state change can trigger a change of the health value.
VRRP instances under interface configurations can be monitored using the oper-groups and their state change can trigger a change of the health value.
SAPs on the node can be monitored using the oper-groups and their operational state can trigger a change of the health value.

Port and oper-group state change influences the reachability of the NAT node and consequently this affects network-wide NAT operation. If that port or path capacity in and out of the NAT node drops below a specific level, a switchover to a healthier NAT node may be needed.

Port states can be tracked or monitored on the private side (inside) and on the public side (outside) of NAT.

Oper-groups are constructs that are tracking states of BFD enabled interfaces, SAPs, and VRRP instances.

BFD sessions targeted to the next hop can traverse intermediate Layer 2 nodes and can have longer reach than port tracking.

Another benefit of monitoring ports and paths is that it can help reduce the amount of traffic on the inter-chassis communication link (ICL) if that active node loses direct connection to the node downstream or upstream from it. The link for inter-chassis control communication (ICL) must always be present (for synchronization purposes). However, this link does not need to be designed for heavy traffic loads during extended periods of time occurs if traffic bearing ports are not colocated with the active node. However, this link is used for shorter transient periods that are caused by switchovers.