The approach to handling update message errors has evolved in the past couple of years. The original BGP protocol specification called for all update message errors to be handled the same way (that is, send a notification to the peer and immediately close the BGP session). This error handling approach was motivated by the goal to ensure protocol ‟correctness” above all else. But, it ignored several important points.
Not all UPDATE message errors truly have the same severity. If the NLRI cannot be extracted and parsed from an UPDATE message then this is indeed a ‟critical” error. But other errors such as incorrect attribute flag settings, missing mandatory path attributes, incorrect next-hop length/format, and so on. can be considered ‟non-critical” and handled differently.
Session resets are extremely costly in terms of their impact on the stability and performance of the network. For many types of UPDATE message errors, a session reset does not solve the problem because the root cause remains (for example, software error, hardware error or misconfiguration). If a session reset is absolutely necessary, then the operator should have some control over the timing.
Some degree of protocol ‟incorrectness” is tolerable for a short period of time as long as the network operator is fully aware of the issue. In this context, ‟incorrectness” typically means a BGP RIB inconsistency between routers in the same AS. Such inconsistency has become less and less of an issue over time as edge-to-edge tunneling of IP traffic (for example, BGP shortcuts, IP VPN) has reduced the number of deployments where IP traffic is forwarded hop-by-hop.
In recognition of these points and the general trend toward more flexibility in BGP error handling, SR OS supports a BGP configuration option called update-fault-tolerance that allows the operator to decide whether the router should apply new or legacy error handling procedures to update message errors. If update-fault-tolerance is configured, then non-critical errors as described above are handled using the ‟treat-as-withdraw” or ‟attribute-discard” approaches to error handling; these approaches do not cause a session reset. If update-fault-tolerance is not configured then legacy procedures continue to apply and all errors (critical and non-critical) trigger a session reset.
If the update-fault-tolerance command was previously configured and a non-critical error was already triggered, the BGP session is still reset when the operator configures no update-fault-tolerance.