Border Gateway Protocol (BGP) is an inter-Autonomous System routing protocol. An Autonomous System (AS) is a set of routers managed and controlled by a common technical administration. BGP-speaking routers establish BGP sessions with other BGP-speaking routers and use these sessions to exchange BGP routes. A BGP route provides information about a network path that can reach an IP prefix or other type of destination. The path information in a BGP route includes the list of ASes that must be traversed to reach the route source; this allows inter-AS routing loops to be detected and avoided. Other path attributes that may be associated with a BGP route include the Local Preference, Origin, Next-Hop, Multi-Exit Discriminator (MED) and Communities. These path attributes can be used to implement complex routing policies.
The primary use of BGP was originally Internet IPv4 routing but multi-protocol extensions to BGP have greatly expanded its applicability. Now BGP is used for many purposes, including:
The next sections provide information about BGP sessions, BGP network design, BGP messages and BGP path attributes.
A BGP session is a TCP connection formed between two BGP routers over which BGP messages are exchanged. There are three types of BGP sessions: internal BGP (IBGP), external BGP (EBGP), and confederation external BGP (confed-EBGP).
An IBGP session is formed when the two BGP routers belong to the same Autonomous System. Routes received from an IBGP peer are not advertised to other IBGP peers unless the router is a route reflector. The two routers that form an IBGP session are usually not directly connected. Figure 17 shows an example of two Autonomous Systems that use BGP to exchange routes. In this example the router ALA-A forms IBGP sessions with ALA-B and ALA-C.
An EBGP session is formed when the two BGP routers belong to different Autonomous Systems. Routes received from an EBGP peer can be advertised to any other peer. The two routers that form an EBGP session are often directly connected but multi-hop EBGP sessions are also possible. When a route is advertised to an EBGP peer the Autonomous System number(s) of the advertising router are added to the AS Path attribute. In the example of Figure 17 the router ALA-A forms an EBGP session with ALA-D.
A confederation EBGP session is formed when the two BGP routers belong to different member AS of the same confederation. More details about BGP confederations are provided in the section titled BGP Confederations.
SR OS supports both statically configured and dynamic (unconfigured) BGP sessions. Dynamic sessions are supported by configuring one or more prefix commands in the dynamic-neighbor>match CLI context of a BGP group. Statically configured BGP sessions are configured using the neighbor command. This command accepts either an IPv4 or IPv6 address, which allows the session transport to be IPv4 or IPv6. By default, the router is the active side of TCP connections to statically configured remote peers, meaning that as soon as a session leaves the Idle state, the router attempts to set up an outgoing TCP connection to the remote neighbor in addition to listening on TCP port 179 for an incoming connection from the peer. If required, a statically configured BGP session can be configured for passive mode so that the router only listens for an incoming connection and does not attempt to set up the outgoing connection. The router always operates in passive mode with respect to its dynamic (unconfigured) sessions.
The source IP address used to set up the TCP connection to the statically configured or dynamic peer can be configured explicitly using the local-address command. If a local-address is not configured then the source IP address is determined as follows:
In addition, it is possible to configure the local address with the name of the router interface. To configure the BGP local address to use the router interface’s IP address information, the local-address command is used in conjunction with the router interface name (ip-int-name) parameter.
Configuring the router interface as the local address is available in both the config>router>bgp>group context and the config>router>bgp>group-neighbor context.
When the router interface is configured as the local address, BGP inherits the address from the interface as follows:
If the corresponding IPv4 or IPv6 address is not configured on the router interface, the BGP sessions that have this interface set as the local address are kept down until an interface address is configured on the router interface.
If the primary IPv4 or IPv6 address is changed on the router interface and that interface is being used as the local address for BGP, then BGP bounces the link. This removes all routes advertised using the previous address and start advertising those routes again using the newly configured IP address.
A BGP session is in one of the following states at any given moment in time:
If a router suspects that its peer at the other end of an established session has experienced a complete failure of both its control and data planes the router should divert traffic away from the failed peer as quickly as possible in order to minimize traffic loss. There are various mechanisms that the router can use to detect such failures, including:
When any one or these mechanisms is triggered the session immediately returns to the Idle state and a new session is attempted. Peer tracking, BFD and fast external failover are described in more detail in the following sections.
When peer tracking is enabled on a session, the neighbor IP address is tracked in the routing table. If a failure occurs and there is no longer any IP route matching the neighbor address, or else if the longest prefix match (LPM) route is rejected by the configurable peer-tracking-policy, then after a 1-second delay the session is taken down. By default, peer-tracking is disabled on all sessions. The default peer-tracking policy allows any type of route to match the neighbor IP address, except aggregate routes and LDP shortcut routes.
Peer tracking was introduced when BFD was not yet supported for peer failure detection. Now that BFD is available, peer-tracking has less value and is used less often.
Note: Peer tracking should be used with caution. Peer tracking can tear a session down even if the loss of connectivity turns out to be short-lived — for example while the IGP protocol is re-converging. Next-hop tracking, which is always enabled, handles such temporary connectivity issues much more effectively. |
SR OS also supports the option to setup an async-mode BFD session to a BGP neighbor so that failure of the BFD session can trigger immediate teardown of the BGP session. When BFD is enabled on a BGP session a 1-hop or multi-hop BFD session is setup to the neighbor IP address and the BFD parameters come from the BFD configuration of the interface associated with the local-address; for multi-hop sessions this is typically the system interface. With a 10 ms transmit interval and a multiplier of 3 BFD can detect a peer failure in a period of time as short of 30 ms.
Fast external failover applies only to single-hop EBGP sessions. When fast external failover is enabled on a single-hop EBGP session and the interface associated with the session goes down the BGP session is immediately taken down as well, even if other mechanisms such as the hold-timer have not yet indicated a failure.
A BGP session reset can be very disruptive – each router participating in the failed session must delete the routes it received from its peer, recalculate new best paths, update forwarding tables (depending on the types of routes), and send route withdrawals and advertisements to other peers. It makes sense then that session resets should be avoided as much as possible and when a session reset cannot be avoided the disruption to the network should be minimized.
To support these objectives, the BGP implementation in SR OS supports two key features:
BGP HA refers to the capability of a router with redundant CPMs to keep established BGP sessions up whenever a planned or unplanned CPM switchover occurs. A planned CPM switchover can occur during In-Service Software Upgrade (ISSU). An unplanned CPM switchover can occur if there is an unexpected failure of the primary CPM.
BGP HA is always enabled on routers with redundant CPMs; it cannot be disabled. BGP HA keeps the standby CPM in-sync with the primary CPM, with respect to BGP and associated TCP state, so that the standby CPM is ready to take over for the primary CPM at any time. The primary CPM is responsible for building and sending the BGP messages to peers but the standby CPM reliably receives a copy of all outgoing UPDATE messages so that it has a synchronized view of the RIB-OUT.
Some BGP routers do not have redundant control plane processor modules or do not support BGP HA with the same quality or coverage as 7450 ESS, 7750 SR, or 7950 XRS routers. When dealing with such routers or certain error conditions, BGP graceful restart (GR) is a good option for minimizing the network disruption caused by a control plane reset.
BGP GR assumes that the router restarting its BGP sessions has the ability and architecture to continue packet forwarding throughout the control plane reset. If this is the case, then the peers of the restarting router act as helpers and “hide” the control plane reset from the rest of the network so that forwarding can continue uninterrupted. Forwarding based on stale routes and hiding the “staleness” from other routers is considered acceptable because the duration of the control plane outage is expected to be relatively short (a few minutes). For BGP GR to be used on a session, both routers must advertise the BGP GR capability during the OPEN message exchange; see the BGP Advertisement section for more details.
BGP GR is enabled on one or more BGP sessions by configuring the graceful-restart command in the global, group, or neighbor context. The command causes GR mode to be supported for the following active families:
Helper mode is activated when one of the following events affects an Established session:
As soon as the failure is detected, the helping 7450 ESS, 7750 SR, or 7950 XRS router marks all the routes received from the peer as stale and starts a restart timer. The stale state is not factored into the BGP decision process, and it is not made visible to other routers in the network. The restart timer derives its initial value from the Restart Time carried in the last GR capability of the peer. The default advertised Restart Time is 300 seconds, but it can be changed using the restart-time command.
When the restart timer expires, helping stops if the session is not yet re-established. If the session is re-established before the restart timer expires and the new GR capability from the restarting router indicates that the forwarding state has been preserved, then helping continues and the peers exchange routes per the normal procedure.
When each router has advertised all its routes for a specific address family, it sends an End-of-RIB marker (EOR) for the address family. The EOR is a minimal UPDATE message with no reachable or unreachable NLRI for the AFI or SAFI. When the helping router receives an EOR, it deletes all remaining stale routes of the AFI or SAFI that were not refreshed in the most recent set of UPDATE messages. The maximum amount of time that routes can remain stale (before being deleted if they are not refreshed) is configurable using the stale-routes-time.
Note: If a second reset occurs before GR has successfully completed, the router always aborts the GR helper process, regardless of the failure trigger. |
SR OS supports Long-Lived Graceful Restart (LLGR). LLGR is supported for the same address families as normal GR, as described in BGP Graceful Restart.
The LLGR procedures adhere to draft-uttaro-idr-bgp-persistence-03. LLGR is intended to handle more serious and longer-term outages than ordinary GR.
SR OS routers support LLGR in the context of both the restarting router (which experienced a restart or failure) and the helper or receiving router (which is a peer of the failed router). Both functionalities are enabled and disabled at the same time by adding the long-lived command under a graceful-restart configuration context.
When long-lived is applied to a session (and capability negotiation is not disabled), the OPEN message sent to the peer includes both the GR capability and the LLGR capability. Both capabilities list the same set of AFI/SAFI.
If a BGP session protected by LLGR goes down due to a restart or failure of the peer, then the SR OS router activates GR+LLGR helper mode for all the protected AFI/SAFI. In GR+LLGR helper mode, the received routes of a particular AFI/SAFI are retained as stale routes for a maximum duration of:
restart-time + LLGR-stale-time
where:
While the restart-timer is running, the SR OS router acts in the normal GR helper role. When the restart-timer elapses, the LLGR phase begins. When LLGR starts, the following occur.
LLGR ends for a particular AFI/SAFI when the LLGR-stale-time reaches zero. At that time, all remaining stale routes of the AFI/SAFI are deleted. The LLGR-stale-time is not stopped by re-establishment of the session with the failed peer; it continues until the EoR marker is received for the AFI/SAFI.
Stale routes may be deleted prior to expiration of the LLGR-stale-time. If the session with the failed peer comes back up and one of the following is true, then the stale routes should be deleted immediately:
When a router running SR OS Release 15.0.R4 or later receives a BGP route of any AFI/SAFI, with the LLGR_STALE community, the decision process considers the route less preferred than any valid, non-stale LLGR route for that NLRI. This logic applies even if the router is not configured as long-lived. If a route with an LLGR_STALE community is selected as the best path, then it is advertised to peers according to the configuration of the advertise-stale-to-all-neighbors command; if this command is absent (or the long-lived context is absent), then the route is advertised only to peers that advertised the LLGR capability.
The operation of a network can be compromised if an unauthorized system is able to form or hijack a BGP session and inject control packets by falsely representing itself as a valid neighbor. This risk can be mitigated by enabling TCP MD5 authentication on one or more of the sessions. When TCP MD5 authentication is enabled on a session every TCP segment exchanged with the peer includes a TCP option (19) containing a 16-byte MD5 digest of the segment (more specifically the TCP/IP pseudo-header, TCP header and TCP data). The MD5 digest is generated and validated using an authentication key that must be known to both sides. If the received digest value is different from the locally computed one then the TCP segment is dropped, thereby protecting the router from spoofed TCP segments.
The TTL security mechanism (GTSM) relies on a simple concept to protect BGP infrastructure from spoofed IP packets. It recognizes the fact that the vast majority of EBGP sessions are established between directly-connected routers and therefore the IP TTL values in packets belonging to these sessions should have predictable values. If an incoming packet does not have the expected IP TTL value it is possible that it is coming from an unauthorized and potentially harmful source.
TTL security is enabled using the ttl-security command. This command requires a minimum TTL value to be specified. When TTL security is enabled on a BGP session the IP TTL values in packets that are supposedly coming from the peer are compared (in hardware) to the configured minimum value and if there is a discrepancy the packet is discarded and a log is generated. TTL security is used most often on single-hop EBGP sessions but it can be used on multihop EBGP and IBGP sessions as well.
To enable TTL security on a single-hop EBGP session, configure ttl-security and multihop to a value of 255. To enable TTL security on a multihop EBGP session, configure ttl-security and multihop to match the expected TTL of (255 - hop count). The TTL value for both EBGP peers must be manually configured to the same value, as there is no TTL negotiation.
Note: IP packets sent to an IBGP peer are originated with an IP TTL value of 64. IP packets to an EBGP peer are originated with an IP TTL value of 1, except if multihop is configured; in that case, the TTL value is taken from the multihop command. |
When the base router has a neighbor identified by an IPv4 address, and therefore the transport of the BGP session uses IPv4 TCP, all MP-BGP address families available in SR OS are supported by that session.
When the base router has a neighbor identified by an IPv6 address, and therefore the transport of the BGP session uses IPv6 TCP, the following MP-BGP address families are supported:
When a VPRN has a neighbor identified by an IPv4 address, and therefore the transport is IPv4 TCP, the following MP-BGP address families are supported:
When a VPRN has a neighbor identified by an IPv6 address, and therefore the transport is IPv6 TCP, the following MP-BGP address families are supported:
In SR OS, every neighbor (and hence BGP session) is configured under a group. A group is a CLI construct that saves configuration effort when multiple peers have a similar configuration; in this situation the common configuration commands can be configured once at the group level and need not be repeated for every neighbor. A single BGP instance can support many groups and each group can support many peers. Most SR OS commands that are available at the neighbor level are also available at the group level.
BGP assumes that all routers within an Autonomous System can reach destinations external to the Autonomous System using efficient, loop-free intra-AS forwarding paths. This generally requires that all the routers within the AS have a consistent view of the best path to every external destination. This is especially true when each BGP router in the AS makes its own forwarding decisions based on its own BGP routing table. The basic BGP specification does not store any intra-AS path information in the AS Path attribute so basic BGP has no way to detect routing loops within an AS that arise from inconsistent best path selections.
There are 3 solutions for dealing the issues outlined above.
Create a confederation of autonomous systems. BGP confederations are described in the section titled BGP Confederations.
In a standard BGP configuration a BGP route learned from one IBGP peer is not re-advertised to another IBGP peer. This rule exists because of the assumption of a full IBGP mesh within the AS. As discussed in the previous section a full IBGP mesh imposes certain scaling challenges. BGP route reflection eliminates the need for a full IBGP mesh by allowing routers configured as route reflectors to re-advertise routes from one IBGP peer to another IBGP peer.
A route reflector provides route reflection service to IBGP peers called clients. Other IBGP peers of the RR are called non-clients. An RR and its client peers form a cluster. A large AS can be sub-divided into multiple clusters, each identified by a unique 32-bit cluster ID. Each cluster contains at least one route reflector which is responsible for redistributing routes to its clients. The clients within a cluster do not need to maintain a full IBGP mesh between each other; they only require IBGP sessions to the route reflector(s) in their cluster. (If the clients within a cluster are fully meshed consider using the disable-client-reflect functionality.) The non-clients in an AS must be fully meshed with each other.
Figure 19 depicts the same network as Figure 18 but with route reflectors deployed to eliminate the IBGP mesh between SR-B, SR-C, and SR-D. SR-A, configured as the route reflector, is responsible for reflection routes to its clients SR-B, SR-C, and SR-D. SR-E and SR-F are non-clients of the route reflector. As a result, a full mesh of IBGP sessions must be maintained between SR-A, SR-E and SR-F.
A router becomes a route reflector whenever it has one or more client IBGP sessions. A client IBGP session is created with the cluster command, which also indicates the cluster ID of the client. Typical practice is to use the router ID as the cluster ID, but this is not necessary.
Basic route reflection operation (without Add-Path configured) can be summarized as follows:
The ORIGINATOR_ID and CLUSTER_LIST attributes allow BGP to detect the looping of a route within the AS. If any router receives a BGP route with an ORIGINATOR_ID attribute containing its own BGP identifier, the route is considered invalid. In addition, if a route reflector receives a BGP route with a CLUSTER_LIST attribute containing a locally configured cluster ID, the route is considered invalid. Invalid routes are not installed in the route table and not advertised to other BGP peers.
BGP confederations are another alternative for avoiding a full mesh of BGP sessions inside an Autonomous System. A BGP confederation is a group of Autonomous Systems managed by a single technical administration that appear as a single AS to BGP routers outside the confederation; the single externally visible AS is called the confederation ID. Each AS in the group is called a member AS and the ASN of each member AS is visible only within the confederation. For this reason, member ASNs are often private ASNs.
Within a confederation EBGP-type sessions can be setup between BGP routers in different member AS. These confederation-EBGP sessions avoid the need for a full mesh between routers in different member ASes. Within each member AS the BGP routers must be fully-meshed with IBGP sessions or route reflectors must be used to ensure routing consistency.
In SR OS, a confederation EBGP session is formed when the ASN of the peer is different from the local ASN and the peer ASN appears as a member AS in the confederation command. The confederation command specifies the confederation ID and up to 15 member AS that are part of the confederation.
When a route is advertised to a confederation-EBGP peer the advertising router prepends its local ASN, which is its member ASN, to a confederation-specific sub-element in the AS_PATH that is created if it does not already exist. The extensions to the AS_PATH are used for loop detection but they do not influence best path selection (that is, they do not increase the AS Path length used in the BGP decision process). The MED, NEXT_HOP and LOCAL_PREF attributes in the received route are propagated unchanged by default. The ORIGINATOR_ID and CLUSTER_LIST attributes are not included in routes to confed-EBGP peers.
When a route is advertised to an EBGP peer outside the confederation the advertising router removes all member AS elements from the AS_PATH and prepends its confederation ID rather than its local/member ASN.
BGP protocol operation relies on the exchange of BGP messages between peers. 7450 ESS, 7750 SR, 7950 XRS, and most other routers, support the following message types: Open Message, Update Message, Keepalive Message, Notification Message, and Route Refresh Message.
The minimum BGP message length is 19 bytes and the maximum is 4096 bytes. BGP messages appear as a stream of bytes to the underlying TCP transport layer, and so there is no direct association between a BGP message and a TCP segment. One TCP segment can include parts of one or more BGP messages. Immediately after session setup, the initial value for the maximum TCP segment size that can be sent toward a specific peer is the minimum of the following:
As time elapses, the maximum sending segment size can fall below the initial value if path MTU discovery (PMTUD) is active on the session. PMTUD lowers the segment size when ICMP unreachable or packet-too-big messages are received. These messages indicate that the IP MTU of the link could not forward the unfragmentable packet and this IP MTU minus 40 (IPv4) or minus 60 (IPv6) bytes sets the new maximum segment size value.
After a TCP connection is established between two BGP routers the first message sent by each one is an Open message. If the received Open message is acceptable a Keepalive message confirming the Open is sent back. (See BGP Session States for more details.) An Open message contains the following information:
Note: Changes to the configured hold-time trigger a session reset. |
Note: A change of the router ID in the config>router>bgp context causes all BGP sessions to be reset immediately while other changes resulting in a new BGP identifier only take effect after BGP is shutdown and re-enabled. |
If the AS number is changed at the router level (config>router) the new AS number is not used until the BGP instance is restarted either by administratively disabling and enabling the BGP instance or by rebooting the system with the new configuration.
On the other hand, if the AS number is changed in the BGP configuration (config>router>bgp), the effects are as follows:
Changing the confederation value on an active BGP instance does not restart the protocol. The change takes effect when the BGP protocol is (re) initialized.
BGP advertisement allows a BGP router to indicate to a peer, using the optional parameter, the features that it supports so that they can coordinate and use only the features that both support. Each capability in the optional parameter is TLV-encoded with a unique type code. SR OS supports the following capability codes:
Update messages are used to advertise and withdraw routes. An Update message provides the following information:
For fast routing convergence, as many NLRI as possible are packed into a single Update message as possible. This requires identifying all the routes that share the same path attribute values.
After a session is established, each router sends periodic Keepalive messages to its peer to test that the peer is still alive and reachable. If no Keepalive or Update message is received from the peer for the negotiated hold-time duration, the session is terminated. The period between one Keepalive message and the next is 1/3 of the negotiated hold-time duration or the value configured with the keepalive command, whichever is less. If the active hold-time or keepalive interval is zero, Keepalive messages are not sent. The default hold-time is 90 seconds and the default keepalive interval is 30 seconds.
A peer (reachability) failure is often detected through faster mechanisms than hold-timer expiry, as explained in Detecting BGP Session Failures.
When a non-recoverable error related to a particular session occurs a Notification message is sent to the peer and the session is terminated (or restarted if GR is enabled for this scenario; see BGP Graceful Restart for more details). The Notification message provides the following information:
The approach to handling Update message errors has evolved in the past couple of years. The original BGP protocol specification called for all UPDATE message errors to be handled the same way — send a NOTIFICATION to the peer and immediately close the BGP session. This error handling approach was motivated by the goal to ensure protocol “correctness” above all else. But it ignored several important points:
In recognition of these points and the general trend towards more flexibility in BGP error handling, SR OS supports a BGP configuration option called update-fault-tolerance that allows the operator to decide whether the router should apply new or legacy error handling procedures to UPDATE message errors. If update-fault-tolerance is configured, then non-critical errors as described above are handled using the “treat-as-withdraw” or “attribute-discard” approaches to error handling; these approaches do not cause a session reset. If update-fault-tolerance is not configured then legacy procedures continue to apply and all errors (critical and non-critical) trigger a session reset.
If the update-fault-tolerance command was previously configured and a non-critical error was already triggered, the BGP session is still reset when the operator configures no update-fault-tolerance.
A BGP router can send a Route Refresh message to its peer only if both have advertised the route refresh capability (code 2). The Route Refresh message is a request for the peer to re-send all or some of its routes associated with a particular pair of AFI/SAFI values. AFI/SAFI values are the same ones used in the MP-BGP capability (see the section titled Multi-Protocol BGP Attributes).
7450, 7750, and 7950 routers only send Route Refresh messages for AFI/SAFI associated with VPN routes that carry Route Target extended communities, such as VPN-IPv4, VPN-IPv6, L2-VPN, MVPN-IPv4 and MVPN-IPv6 routes. By default, routes of these types are discarded if, at the time they are received, there is no VPN that imports any of the route targets they carry. If at a later time a VPN is added or reconfigured (in terms of the route targets that it imports), a Route Refresh message is sent to all relevant peers, so that previously discarded routes can be relearned.
Note: Route Refresh messages are not sent for VPN-IPv4 and VPN-IPv6 routes if mp-bgp-keep is configured; in this situation received VPN-IP routes are kept in the RIB-IN regardless of whether or not they match a VRF import policy. |
Path attributes are fundamental to BGP. A BGP route for a particular NLRI is distinguished from other BGP routes for the same NLRI by its set of path attributes. Each path attribute describes some property of the path and is encoded as a TLV in the Path Attributes field of the Update message. The type field of the TLV identifies the path attribute and the value field carries data specific to the attribute type. There are 4 different categories of path attributes:
SR OS supports the following path attributes, which are described in detail in upcoming sections:
The ORIGIN path attribute indicates the origin of the path information. There are three supported values:
When a router originates a VPN-IP prefix (from a non-BGP route), it sets the value of the Origin attribute to IGP. When a router originates an BGP route for an IP prefix by exporting a non-BGP route from the routing table, it sets the value of the Origin attribute to Incomplete. Route policies (BGP import and export) can be used to change the Origin value.
The AS_PATH attribute provides the list of Autonomous Systems through which the routing information has passed. The AS_PATH attribute is composed of segments. There can be up to 4 different types of segments in an AS_PATH attribute: AS_SET, AS_SEQUENCE, AS_CONFED_SET and AS_CONFED_SEQUENCE. The AS_SET and AS_CONFED_SET segment types result from route aggregation. AS_CONFED_SEQUENCE contains an ordered list of member AS through which the route has passed inside a confederation. AS_SEQUENCE contains an ordered list of AS (including confederation IDs) through which the route has passed on its way to the local AS/confederation.
The AS numbers in the AS_PATH attribute are all 2-byte values or all 4-byte values (if the 4-octet ASN capability was announced by both peers).
A BGP router always prepends its AS number to the AS_PATH attribute when advertising a route to an EBGP peer. The specific details for a 7450, 7750, or 7950 router are described below.
BGP import policies can be used to prepend an AS number multiple times to the AS_PATH, whether the route is received from an IBGP, EBGP or confederation EBGP peer. The AS path prepend action is also supported in BGP export policies applied to these types of peers, regardless of whether the route is locally originated or not. AS path prepending in export policies occurs before the global and/or local ASes (if applicable) are added to the AS_PATH.
When a BGP router receives a route containing one of its own Autonomous System numbers (local or global or confederation ID) in the AS_PATH the route is normally considered invalid for reason of an AS path loop. However, SR OS provides a loop-detect command that allows this check to be bypassed. If it known that advertising certain routes to an EBGP peer results in an AS path loop condition and yet there is no loop (assured by other mechanisms, such as the Site of Origin (SOO) extended community), then as-override can be configured on the advertising router instead of disabling loop detection on the receiving router. The as-override command replaces all occurrences of the peer AS in the AS_PATH with the advertising router’s local AS.
The AS Override feature can be used in VPRN scenarios where a customer is running BGP as the PE-CE protocol and some or all of the CE locations are in the same Autonomous System (AS). With normal BGP, two sites in the same AS would not be able to reach each other directly since there is an apparent loop in the AS Path.
When as-override is configured on a PE-CE EBGP session the PE rewrites the customer ASN in the AS Path with the VPRN AS number as the route is advertised to the CE.
The description in the previous section does fully explain the reasons for using local-as. This BGP feature facilitates the process of changing the ASN of all the routers in a network from one number to another. This may be necessary if one network operator merges with or acquires another network operator and the two BGP networks must be consolidated into one Autonomous System.
For example, suppose the operator of the ASN 64500 network merges with the operator of the ASN 64501 network and the new merged entity decides to renumber ASN 64501 routers as ASN 64500 routers, so that the entire network can be managed as one Autonomous System. The migration can be carried out using the following sequence of steps:
This migration procedure has several advantages. First, customers, settlement-free peers and transit providers of the previous ASN 64501 network still perceive that they are peering with ASN 64501 and can delay switching to ASN 64500 until the time is convenient for them. Second, the AS path lengths of the routes exchanged with the EBGP peers are unchanged from before so that best path selections are preserved.
When BGP was developed, it was assumed that 16-bit (2-octet) ASNs would be sufficient for global Internet routing. In theory a 16-bit ASN allows for 65536 unique autonomous systems but some of the values are reserved (0 and 64000-65535). Of the assignable space less than 10% remains available. When a new AS number is needed it is now simpler to obtain a 4-octet AS number. 4-octet AS numbers have been available since 2006. A 32-bit (4-octet) ASN allows for 4,294,967,296 unique values (some of which are again, reserved).
When 4-octet AS numbers became available it was recognized that not all routers would immediately support the ability to parse 4-octet AS numbers in BGP messages so two optional transitive attributes called AS4_PATH and AS4_AGGREGATOR were introduced to allow a gradual migration.
A BGP router that supports 4-octet AS numbers advertises this capability in its OPEN message; the capability information includes the AS number of the sending BGP router, encoded using 4 bytes (recall the ASN field in the OPEN message is limited to 2 bytes). By default, OPEN messages sent by 7450, 7750, or 7950 routers always include the 4-octet ASN capability, but this can changed using the disable-4byte-asn command.
If a BGP router and its peer have both announced the 4-octet ASN capability, then the AS numbers in the AS_PATH and AGGREGATOR attributes are always encoded as 4-byte values in the UPDATE messages they send to each other. These UPDATE messages should not contain the AS4_PATH and AS4_AGGREGATOR path attributes.
If one of the routers involved in a session announces the 4-octet ASN capability and the other one does not, then the AS numbers in the AS_PATH and AGGREGATOR attributes are encoded as 2-byte values in the UPDATE messages they send to each other.
When a 7450, 7750, or 7950 router advertises a route to a peer that did not announce the 4-octet ASN capability:
When a 7450, 7750, or 7950 router receives a route with an AS4_PATH attribute it attempts to reconstruct the full AS path from the AS4_PATH and AS_PATH attributes, regardless of whether disable-4byte-asn is configured or not. The reconstructed path is the AS path displayed in BGP show commands. If the length of the received AS4_PATH is N and the length of the received AS_PATH is N+t, then the reconstructed AS path contains the t leading elements of the AS_PATH followed by all the elements in the AS4_PATH.
The NEXT_HOP attribute indicates the IPv4 address of the BGP router that is the next-hop to reach the IPv4 prefixes in the NLRI field. If the Update message is advertising routes other than IPv4 unicast routes the next-hop of these routes is encoded in the MP_REACH_NLRI attribute; see Multi-Protocol BGP Attributes for more details.
The rules for deciding what next-hop address types to accept in a received BGP route and what next-hop address types to advertise as a BGP next-hop are address family dependent. The following sections summarize the key details.
By default, IPv4 routes are advertised with IPv4 next-hops but on IPv6-TCP transport sessions they can be advertised with IPv6 next-hops if the advertise-ipv6-next-hops command (with the IPv4 option) applies to the session. In order to receive IPv4 routes with IPv6 next-hop addresses from a peer, the extended-nh-encoding command (with the IPv4 option) must be applied to the session. This advertises the corresponding RFC 5549, Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop, capability to the peer.
Whenever next-hop-self applies to an IPv4 route, the next hop is set as follows:
When an IPv4 BGP route is advertised to an EBGP peer, next-hop-self always applies except if the third-party-nexthop command is applied. Configuring third-party-nexthop allows an IPv4 route received from one EBGP peer to be advertised to another EBGP that is in the same IP subnet with an unchanged BGP next-hop.
When an IPv4 BGP route is re-advertised to an IBGP or confederation EBGP peer, the advertising router does not modify the BGP next-hop unless one of the following applies:
When an IPv4 BGP route is locally originated and advertised to an IBGP or confederation EBGP peer, the BGP next-hop is, by default, copied from the next hop of the route that was imported into BGP, with certain exceptions (for example, black-hole next-hop). When a static route with indirect next hop is re-advertised as a BGP route, the BGP next-hop is a copy of the indirect address. However, with route table import policies, BGP can be instructed to take the resolved next hop of the static route as the BGP next-hop address.
SR OS routers never send or receive IPv6 routes with 32-bit IPv4 next-hop addresses.
When an IPv6 BGP route is advertised to an EBGP peer, next-hop-self always applies (except if the third-party-nexthop command is applied, as described in the following note). Next-hop-self results in one of the following outcomes:
Note: Configuring third-party-nexthop allows an IPv6 route received from one EBGP peer to be advertised to another EBGP that is in the same IP subnet with an unchanged BGP next-hop. |
When an IPv6 BGP route is re-advertised to an IBGP or confederation-EBGP peer, the advertising router does not modify the BGP next-hop by default; however, this can be changed as follows:
When an IPv6 BGP route is locally originated and advertised to an IBGP or confederation- EBGP peer, the BGP next-hop is, by default, copied from the next-hop of the route that was imported into BGP, with certain exceptions (for example, black-hole next-hop). When a static route with indirect next-hop is re-advertised as a BGP route, the BGP next-hop is a copy of the indirect address, however with route-table-import policies BGP can be instructed to take the resolved next-hop of the static route as the BGP next-hop address.
SR OS routers can always send and receive VPN-IPv4 routes with IPv4 next-hops. They can also be configured (using the extended-nh-encoding command) to receive VPN-IPv4 routes with IPv6 next-hop addresses from selected BGP peers by signaling the corresponding Extended NH Encoding BGP capability to those peers during session setup. If the capability is not advertised to a peer, then such routes are not accepted from that peer. Also, if the SR OS router does not receive an Extended NH Encoding capability advertisement for [NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2] from a peer, then it does not advertise VPN-IPv4 routes with IPv6 next-hops to that peer.
When a VPN-IPv4 BGP route is advertised to an EBGP peer, next-hop-self applies if enable-inter-as-vpn is configured; otherwise there is no change to the next-hop. Next-hop-self results in one of the following outcomes:
If enable-inter-as-vpn is configured, a VPN-IPv4 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer, next-hop-self applies. (If enable-inter-as-vpn is not configured, the next-hop may be changed with the next-hop-self command, however, this is strongly discouraged because it can result in a change of the next-hop without a change in the VPN label.) Next-hop-self results in one of the following outcomes:
When a VPN-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer, the RR does not modify the next-hop by default. However, if the next-hop-self command is applied to the IBGP peer receiving the route and enable-rr-vpn-forwarding is configured then this combination of commands has one of the following outcomes:
When a VPN-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer and enable-rr-vpn-forwarding is configured and the VPN-IPv4 route is matched and accepted by an export policy entry with a next-hop <ip-address> action, this changes the BGP next-hop of the matched routes to <ip-address>, except if <ip-address> is an IPv6 address and the receiving IBGP peer did not advertise an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) or, in the configuration of the local router, the session is not associated with an advertise-ipv6-next-hops vpn-ipv4 command. In this case, the route is treated as though it was rejected by the policy entry.
SR OS routers never send or receive VPN-IPv6 routes with 32-bit IPv4 next-hop addresses.
When a VPN-IPv6 BGP route is advertised to an EBGP peer, next-hop-self applies if enable-inter-as-vpn is configured. Otherwise, there is no change to the next-hop. Next-hop-self results in one of the following outcomes:
If enable-inter-as-vpn is configured, a VPN-IPv6 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer next-hop-self applies. (If enable-inter-as-vpn is not configured, the next-hop may be changed with the next-hop-self command, however, this is strongly discouraged because it can result in a change of the next-hop without a change in the VPN label.) Next-hop-self results in one of the following outcomes:
When a VPN-IPv6 BGP route is reflected from one IBGP peer to another IBGP peer, the RR does not modify the next-hop by default. However, if the next-hop-self command is applied to the IBGP peer receiving the route and enable-rr-vpn-forwarding is configured, then this combination of commands has one of the following outcomes:
When a VPN-IPv6 BGP route is reflected from one IBGP peer to another IBGP peer and enable-rr-vpn-forwarding is configured and the VPN-IPv6 route is matched and accepted by an export policy entry with a next-hop <ip-address> action, this changes the BGP next-hop of the matched routes to <ip-address> if it is specified as a 128-bit IPv6 address or to an IPv4-mapped IPv6 address encoding <ip-address> if it is specified as a 32-bit IPv4 address.
SR OS routers can always send and receive label-IPv4 routes with IPv4 next-hops. They can also be configured (using the extended-nh-encoding command) to receive label-IPv4 routes with IPv6 next-hop addresses from selected BGP peers by signaling the corresponding Extended NH Encoding BGP capability to those peers during session setup. If the capability is not advertised to a peer then such routes are not accepted from that peer. Also, if the SR OS router does not receive an Extended NH Encoding capability advertisement for [NLRI AFI=1, NLRI SAFI=4, next-hop AFI=2] from a peer then it is not advertise label-IPv4 routes with IPv6 next-hops to that peer.
When a label-IPv4 BGP route is advertised to an EBGP peer, next-hop-self applies unless the EBGP session has next-hop-unchanged enabled for the label-ipv4 address family. Next-hop-self results in one of the following outcomes:
When a label-IPv4 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer, next-hop-self applies unless the IBGP or confederation-EBGP session has next-hop-unchanged enabled for the label-ipv4 address family. Next-hop-self results in one of the following outcomes:
When a label-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer, the RR does not modify the next-hop by default. However, if the next-hop-self command is applied to the IBGP peer receiving the route, then this results in one of the following outcomes:
When a label-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer and the label-IPv4 route is matched and accepted by an export policy entry with a next-hop <ip-address> action, this changes the BGP next-hop of the matched routes to <ip-address>, except if <ip-address> is an IPv6 address and the receiving IBGP peer did not advertise an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) or, in the configuration of the local router, the session is not associated with an advertise-ipv6-next-hops label-ipv4 command. In this case, the route is treated as though it was rejected by the policy entry.
SR OS routers never send or receive label-IPv6 routes with 32-bit IPv4 next-hop addresses.
When a label-IPv6 BGP route is advertised to an EBGP peer, next-hop-self command applies unless the EBGP session has next-hop-unchanged command enabled for the label-ipv6 address family. next-hop-self results in one of the following outcomes:
When a label-IPv6 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer, next-hop-self applies unless the IBGP or confederation-EBGP session has the next-hop-unchanged command enabled for the label-IPv6 address family. The next-hop-self results in one of the following:
To use a BGP route for forwarding, a BGP router must know how to reach the BGP next-hop of the route. The process of determining the local interface or tunnel used to reach the BGP next-hop is called next-hop resolution. The BGP next-hop resolution process depends on the type of route (the AFI/SAFI) and various configuration settings.
To enable the next-hop resolution of unlabeled IPv4 routes using tunnels in the tunnel-table of the router, it is necessary to add family ipv4 under the config>router>bgp>next-hop-resolution>shortcut-tunnel context. In the family ipv4 context, there are commands to specify the resolution mode (any, disabled or filter) and the set of tunnel types that are eligible to be used if filter mode (resolution filter) is selected.
If the resolution mode is set to disabled, the next-hops of unlabeled IPv4 routes can only be resolved by route table lookup.
If there are multiple tunnels in tunnel-table that are allowed by resolution any or resolution filter mode and that can resolve the BGP next-hop, the selection of the resolving tunnel comes down to factors such as route color, admin-tag-policy, tunnel-table preference, and LDP FEC prefix length.
If disallow-igp is enabled, then there is no attempt to resolve the IPv4 BGP route using route table lookup if no resolving tunnel can be found in the tunnel-table.
The available tunneling options for IPv4 shortcuts are:
To enable the next-hop resolution of unlabeled IPv6 routes using tunnels in the tunnel-table of the router, it is necessary to add family ipv6 under the config>router>bgp>next-hop-resolution>shortcut-tunnel context. If the next-hop of the IPv6 BGP route contains an IPv4-mapped IPv6 address, the shortcut-tunnel configuration applies to the use of IPv4 tunnels and IPv4 routes that match the embedded IPv4 address in the BGP next-hop. If the BGP next-hop is any other IPv6 address the shortcut-tunnel configuration applies to the use of IPv6 tunnels and IPv6 routes that match the full address of the BGP next-hop.
The family ipv6 context has commands to specify the resolution mode (any, disabled or filter) and the set of tunnel types that are eligible to be used if filter mode (resolution filter) is selected.
If there are multiple tunnels in tunnel-table that are allowed by resolution any or resolution filter mode and that can resolve the BGP next-hop, the selection of the resolving tunnel comes down to factors such as route color, admin-tag-policy, tunnel-table preference, and LDP FEC prefix length.
If the resolution mode is set to disabled, the next-hops of unlabeled IPv6 routes can only be resolved by route table lookup.
If there are multiple tunnels in tunnel-table that are allowed by resolution any or resolution filter mode and that can resolve the BGP next-hop, the selection of the resolving tunnel comes down to factors such as route color, admin-tag-policy, tunnel-table preference, and LDP FEC prefix length.
If disallow-igp is enabled, then there is no attempt to resolve the IPv6 BGP route using route table lookup if no resolving tunnel can be found in the tunnel-table.
The available tunneling options for IPv6 BGP routes with IPv4-mapped IPv6 next-hops are:
Use the following CLI syntax to configure next-hop resolution of BGP labeled routes.
The transport-tunnel context provides separate control for the different types of BGP label routes: label-IPv4, label-IPv6, and VPN routes (which includes both VPN-IPv4 and VPN-IPv6 routes). By default, all labeled routes resolve to LDP (even if the preceding CLI commands are not configured in the system).
If the resolution option is set to disabled, the default binding to LDP tunnels resumes. If resolution is set to any, the supported tunnel type selection is based on TTM preference. The order of preference of TTM tunnels is: RSVP, SR-TE, LDP, segment routing OSPF, segment routing IS-IS, and UDP.
The rsvp option instructs BGP to search for the best metric RSVP LSP to the address of the BGP next-hop. The address can correspond to the system interface or to another loopback used by the BGP instance on the remote node. The LSP metric is provided by MPLS in the tunnel table. In the case of multiple RSVP LSPs with the same lowest metric, BGP selects the LSP with the lowest tunnel ID.
The ldp option instructs BGP to search for an LDP LSP with a FEC prefix corresponding to the address of the BGP next-hop.
The bgp option instructs BGP to search for a BGP tunnel in TTM with a prefix matching the address of the BGP next-hop. A label-unicast IPv4 route cannot be resolved by another label-unicast IPv4 or IPv6 route. A label-unicast IPv6 route cannot be resolved by another label-unicast IPv6 route, but it can be resolved by a label-unicast IPv4 route.
When the sr-isis or sr-ospf option is enabled, an SR tunnel to the BGP next-hop is selected in the TTM from the lowest preference IS-IS or OSPF instance. If many instances have the same lowest preference, the lowest numbered IS-IS or OSPF instance is chosen.
The sr-te value launches a search for the best metric SR-TE LSP to the address of the BGP next-hop. The LSP metric is provided by MPLS in the tunnel table. In the case of multiple SR-TE LSPs with the same lowest metric, BGP selects the LSP with the lowest tunnel-id.
The udp value instructs BGP to look for an MPLSoUDP tunnel to the address of the BGP next-hop.
If one or more explicit tunnel types are specified using the resolution-filter option, then only these tunnel types are selected again following the TTM preference. The resolution command must be set to filter to activate the list of tunnel-types configured in resolution-filter.
In SR OS next-hop resolution is not a one-time event. If the IP route or tunnel that was used to resolve a BGP next-hop is withdrawn due to a failure or configuration change an attempt is made to re-resolve the BGP next-hop using the next-best route or tunnel. If there are no more eligible routes or tunnels to resolve the BGP next-hop then the BGP next-hop becomes unresolved. The continual process of monitoring and reacting to resolving route/tunnel changes is called next-hop tracking. In SR OS next-hop tracking is completely event driven as opposed to timer driven; this provides the best possible convergence performance.
SR OS supports next-hop indirection for most types of BGP routes. Next-hop indirection means BGP next-hops are logically separated from resolved next-hops in the forwarding plane (IOMs). This separation allows routes that share the same BGP next-hops to be grouped so that when there is a change to the way a BGP next-hop is resolved only one forwarding plane update is needed, as opposed to one update for every route in the group. The convergence time after the next-hop resolution change is uniform and not linear with the number of prefixes; in other words, the next-hop indirection is a technology that supports prefix independent convergence (PIC). SR OS uses next-hop indirection whenever possible; there is no option to disable the functionality.
The router supports the MPLS entropy label, as specified in RFC 6790, on RFC 3107 BGP labeled routes. LSR nodes in a network can load-balance labeled packets in a more granular way than by hashing on the standard label stack. Refer to the MPLS Guide for more information.
Entropy Label Capability (ELC) signaling is not supported for labeled routes representing BGP tunnels. Instead, ELC is configured at the head end LER using the configure router bgp override-tunnel-elc command. This command causes the router to ignore any advertisements for ELC that may or may not be received from the network, and instead to assume that the whole domain supports entropy labels.
The Multi-Exit Discriminator (MED) attribute is an optional attribute that can be added to routes advertised to an EBGP peer to influence the flow of inbound traffic to the AS. The MED attribute carries a 32-bit metric value. A lower metric is better than a higher metric when MED is compared by the BGP decision process. Unless the always-compare-med command is configured MED is compared only if the routes come from the same neighbor AS. By default, if a route is received without a MED attribute it is evaluated by the BGP decision process as though it had a MED containing the value 0, but this can be changed so that a missing MED attribute is handled the same as a MED with the maximum value. SR OS always removes the received MED attribute when advertising the route to an EBGP peer.
Deterministic MED is an optional enhancement to the BGP decision process that causes BGP to group paths that are equal up to the MED comparison step based on the neighbor AS. BGP compares the best path from each group to arrive at the overall best path. This change to the BGP decision process makes best path selection completely deterministic in all cases. Without deterministic-med, the overall best path selection is sometimes dependent on the order of route arrival because of the rule that MED cannot be compared in routes from different neighbor AS.
Note: When BGP routes are leaked into a target BGP RIB, they are not grouped (in a deterministic MED context) with routes learned by that target RIB, even if the neighbor AS happens to be the same. |
The LOCAL_PREF attribute is a well-known attribute that should be included in every route advertised to an IBGP or confederation-EBGP peer. It is used to influence the flow of outbound traffic from the AS. The local preference is a 32-bit value and higher values are more preferred by the BGP decision process. The LOCAL_PREF attribute is not included in routes advertised to EBGP peers (if the attribute is received from an EBGP peer it is ignored).
In SR OS the default local preference is 100 but this can be changed with the local-preference command or using route policies. When a LOCAL_PREF attribute needs to be added to a route because it does not have one (for example, because it was received from an EBGP peer) the value is the configured or default local-preference unless overridden by policy.
An aggregate route is a configured IP route that is activated and installed in the routing table when it has at least one contributing route. A route (R) contributes to an aggregate route (S1) if all of the following conditions are true:
When an aggregate route is activated by a router, it is not installed in the forwarding table by default. In general though, it is advisable to specify the black-hole next-hop option for an aggregate route, so that when it is activated it is installed in the forwarding table with a black-hole next-hop; this avoids the possibility of creating a routing loop. SR OS also supports the option to program an aggregate route into the forwarding table with an indirect next-hop; in this case, packets matching the aggregate route but not a more-specific contributing route are forwarded towards the indirect next-hop rather than discarded.
An active aggregate route can be advertised to a BGP peer (by exporting it into BGP) and this can avoid the need to advertise the more-specific contributing routes to the peer, reducing the number of routes in the peer AS and improving overall scalability. When a router advertises an aggregate route to a BGP peer the attributes in the route are set as follows:
Note: SR OS does not require all the contributing routes to have the same MED value. |
A BGP route can be associated with one or more communities. There are three kinds of BGP communities:
In a standard community, the first two bytes usually encode the AS number of the administrative entity that assigned the value in the last two bytes. In SR OS, a standard community value is configured using the format <asnum:comm-value>; the colon is a required separator character. In route policy applications, multiple standard community values can be matched with a regular expression in the format <regex1>:<regex2>, where regex1 and regex2 are two regular expressions that are evaluated one numerical digit at a time.
The following well-known standard communities are understood and acted upon accordingly by SR OS routers.
Standard communities can be added to or removed from BGP routes using BGP import and export policies. When a BGP route is locally originated by exporting a static or aggregate route into BGP, and the static or aggregate route has one or more standard communities, these community values are automatically added to the BGP route. This may affect the advertisement of the locally originated route if one of the well-known communities is associated with the static or aggregate route.
To remove all the standard communities from all routes advertised to a BGP peer, use the disable-communities standard command.
Extended communities serve specialized roles. Each extended community is eight bytes. The first one or two bytes identifies the type or sub-type and the remaining six or seven bytes identify a value. Some of the more common extended communities supported by SR OS include:
Extended communities can be added to or removed from BGP routes using BGP import and export policies. When a BGP route is locally originated by exporting a static or aggregate route into BGP, and the static or aggregate route has one or more extended communities, these community values are automatically added to the BGP route.
Note: While it may not make sense to add certain types of extended communities to routes of certain address families, SR OS allows such actions. |
To remove all the extended communities from all routes advertised to a BGP peer, use the disable-communities extended command.
Each large community is a 12-byte value, formed from the logical concatenation of three 4-octet values: a Global Administrator part, a Local Data part 1, and Local Data part 2. The Global Administrator is a four-octet namespace identifier, which should be an Autonomous System Number assigned by IANA. The Global Administrator field is intended to allow different Autonomous Systems to define large communities without collision. Local Data Part 1 is a four-octet operator-defined value and Local Data Part 2 is another four-octet operator-defined value.
In SR OS, a large community value is configured using the format <ext-asnum>:<ext-comm-val>:<ext-comm-val>; the colon is a required separator character between each of the 4-byte values. In route policy applications, it is possible to match multiple large community values with a regular expression in the format <regex1>&<regex2>&<regex3>, where regex1, regex2 and regex3 are three regular expressions, each evaluated one numerical (decimal) digit at a time.
Large communities can be added to or removed from BGP routes using BGP import and export policies. When a BGP route is locally originated by exporting a static or aggregate route into BGP, and the static or aggregate route has one or more large communities, these community values are automatically added to the BGP route.
To remove all the standard communities from all routes advertised to a BGP peer, use the disable-communities large command.
The ORIGINATOR_ID and CLUSTER_LIST are optional non-transitive attributes that play a role in route reflection, as described in the section titled Route Reflection.
As discussed in the BGP chapter overview the uses of BGP have increased well beyond Internet IPv4 routing due to its support for multi-protocol extensions, or more simply MP-BGP. MP-BGP allows BGP peers to exchange routes for NLRI other than IPv4 prefixes - for example IPv6 prefixes, Layer 3 VPN routes, Layer 2 VPN routes, FlowSpec rules, and so on. A BGP router that supports MP-BGP indicates the types of routes it wants to exchange with a peer by including the corresponding AFI (Address Family Identifier) and SAFI (Subsequent Address Family Identifier) values in the MP-BGP capability of its OPEN message. The two peers forming a session do not need to indicate support for the same address families. As long as there is one AFI/SAFI in common the session establishes and routes associated with all the common AFI/SAFI can be exchanged between the peers.
The list of AFI/SAFI advertised in the MP-BGP capability is controlled entirely by the family commands. The AFI/SAFI supported by the SR OS and the method of configuring the AFI/SAFI support is summarized in Table 5.
Name | AFI | SAFI | Configuration Commands |
IPv4 unicast | 1 | 1 | family ipv4 |
IPv4 multicast | 1 | 2 | family mcast-ipv4 |
IPv4 labeled unicast | 1 | 4 | family label-ipv4 |
NG-MVPN IPv4 | 1 | 5 | family mvpn-ipv4 |
MDT-SAFI | 1 | 66 | family mdt-safi |
VPN-IPv4 | 1 | 128 | family vpn-ipv4 |
VPN-IPv4 multicast | 1 | 129 | family mcast-vpn-ipv4 |
RT constrain | 1 | 132 | family route-target |
IPv4 flow-spec | 1 | 133 | family flow-ipv4 |
IPv6 unicast | 2 | 1 | family ipv6 |
IPv6 multicast | 2 | 2 | family mcast-ipv6 |
IPv6 labeled unicast | 2 | 4 | family label-ipv6 |
NG-MVPN IPv6 | 2 | 5 | family mvpn-ipv6 |
VPN-IPv6 | 2 | 128 | family vpn-ipv6 |
IPv6 flow-spec | 2 | 133 | family flow-ipv6 |
Multi-segment PW | 25 | 6 | family ms-pw |
L2 VPN | 25 | 65 | family l2-vpn |
EVPN | 25 | 70 | family evpn |
To advertise reachable routes of a particular AFI/SAFI a BGP router includes a single MP_REACH_NLRI attribute in the UPDATE message. The MP_REACH_NLRI attribute encodes the AFI, the SAFI, the BGP next-hop and all the reachable NLRI. To withdraw routes of a particular AFI/SAFI a BGP router includes a single MP_UNREACH_NLRI attribute in the UPDATE message. The MP_UNREACH_NLRI attribute encodes the AFI, the SAFI and all the withdrawn NLRI. While it is valid to advertise and withdraw IPv4 unicast routes using the MP_REACH_NLRI and MP_UNREACH_NLRI attributes, SR OS always uses the IPv4 fields of the UPDATE message to convey reachable and unreachable IPv4 unicast routes.
The AS4_PATH and AS4_AGGREGATOR path attributes are optional transitive attributes that support the gradual migration of routers that can understand and parse 4-octet ASN numbers. The use of these attributes is discussed in the section titled 4-Octet Autonomous System Numbers.
The accumulated IGP (AIGP) metric is an optional non-transitive attribute that can be attached to selected routes (using route policies) to influence the BGP decision process to prefer BGP paths with a lower end-to-end IGP cost, even when the compared paths span more than one AS or IGP instance. AIGP is different from MED in several important ways:
In the SR OS implementation, AIGP is supported only in the base router BGP instance and only for the following types of routes: IPv4, label-IPv4, IPv6 and label-IPv6. The AIGP attribute is only sent to peers configured with the aigp command. If the attribute is received from a peer that is not configured for aigp or if the attribute is received in a non-supported route type the attribute is discarded and not propagated to other peers (but it is still displayed in BGP show commands).
When a 7450, 7750, or 7950 router receives a route with an AIGP attribute and it re-advertises the route to an AIGP-enabled peer without any change to the BGP next-hop the AIGP metric value is unchanged by the advertisement (RIB-OUT) process. But if the route is re-advertised with a new BGP next-hop the AIGP metric value is automatically incremented by the route table (or tunnel table) cost to reach the received BGP next-hop and/or by a statically configured value (using route policies).
The entire set of BGP routes learned and advertised by a BGP router make up its BGP Routing Information Base (RIB). Conceptually the BGP RIB can be divided into 3 parts:
The RIB-IN (or Adj-RIBs-In as defined in RFC 4271) holds the BGP routes that were received from peers and that the router decided to keep (store in its memory).
The LOC-RIB contains modified versions of the BGP routes in the RIB-IN. The path attributes of a RIB-IN route can be modified using BGP import policies. All of the LOC-RIB routes for the same NLRI are compared in a procedure called the BGP decision process that results in the selection of the best path for each NLRI. The best paths in the LOC-RIB are the ones that are actually ‘usable’ by the local router for forwarding, filtering, auto-discovery, and so on.
The RIB-OUT (or Adj-RIBs-Out as defined in RFC 4271) holds the BGP routes that were advertised to peers. Normally a BGP route is not advertised to a peer (in the RIB-OUT) unless it is ‘used’ locally but there are exceptions. BGP export policies modify the path attributes of a LOC-RIB route to create the path attributes of the RIB-OUT route. A particular LOC-RIB route can be advertised with different path attribute values to different peers so there can exist a 1:N relationship between LOC-RIB and RIB-OUT routes.
The following sections describe many important BGP features in the context of the RIB architecture outlined above.
SR OS implements the following features related to RIB-IN processing:
The import command is used to apply one or more policies (up to 15) to a neighbor, group or to the entire BGP context. The import command that is most-specific to a peer is the one that is applied. An import policy command applied at the neighbor level takes precedence over the same command applied at the group or global level. An import policy command applied at the group level takes precedence over the same command specified on the global level. The import policies applied at different levels are not cumulative. The policies listed in an import command are evaluated in the order in which they are specified.
Note: The import command can reference a policy before it has been created (as a policy-statement). |
When an IP route is rejected by an import policy it is still maintained in the RIB-IN so that a policy change can be made later on without requiring the peer to re-send all its RIB-OUT routes. This is sometimes called soft reconfiguration inbound and requires no special configuration in SR OS.
When a VPN route is rejected by an import policy or not imported by any services it is deleted from the RIB-IN. For VPN-IPv4 and VPN-IPv6 routes this behavior can be changed by configuring the mp-bgp-keep command; this option maintains rejected VPN-IP routes in the RIB-IN so that a Route Refresh message does not have to be issued when there is an import policy change.
SR OS implements the following features related to LOC-RIB processing.
These features are discussed in the following sections.
When a BGP router has multiple paths in its RIB for the same NLRI, its BGP decision process is responsible for deciding which path is the best. The best path can be used by the local router and advertised to other BGP peers.
On 7450 ESS, 7750 SR, and 7950 XRS routers, the BGP decision process orders received paths based on the following sequence of comparisons. If there is a tie between paths at any step, BGP proceeds to the next step.
Each BGP RIB with IP routes (unlabeled IPv4, labeled-unicast IPv4, unlabeled IPv6, and labeled-unicast IPv6) submits its best path for each prefix to the common IP route table, unless the disable-route-table-install command is configured or the selective-label-ipv4-install command has prevented the installation. The best path is selected by the BGP decision process. The default preference for BGP routes submitted by the label-IPv4 and label-IPv6 RIBs (these appear in the route table and FIB as having a BGP-LABEL protocol type) can be modified by using the label-preference command. The default preference for BGP routes submitted by the unlabeled IPv4 and IPv6 RIBs can be modified by using the preference command.
Note: The BGP instance level disable-route-table-install command can be configured on control-plane route reflectors that are not involved in packet forwarding (that is, those that do not modify the BGP next hop). This command improves performance and scalability. The disable-route-table-install policy action can be applied to BGP routes matching a peer import policy to conserve FIB space on a router that is in the datapath, for example, a router that should advertise BGP routes with itself as next hop even though it has not installed those routes into its own forwarding table. |
If a BGP RIB has multiple BGP paths for the same IPv4 or IPv6 prefix that qualify as the best path up to a certain point in the comparison process, then a certain number of these multi-paths can be submitted to the common IP route table. This is called BGP multi-path and must be explicitly enabled using one or more commands in the multi-path context. These commands specify the maximum number of BGP paths, including the overall best path, that each BGP RIB can submit to the route table for any particular IPv4 or IPv6 prefix. If ECMP, with a limit of n, is enabled in the base router instance, then up to n paths are selected for installation in the IP FIB. In the data-path, traffic matching the IP route is load-shared across the ECMP next hops based on a per-packet hash calculation.
By default, the hashing is not sticky, meaning that when one or more of the ECMP BGP next hops fail, all traffic flows matching the route are potentially moved to new BGP next hops. If required, a BGP route can be marked (using the sticky-ecmp action in route policies) for sticky ECMP behavior so that BGP next hop failures are handled by moving only the affected traffic flows to the remaining next hops as evenly as possible. If new ECMP BGP next hops become available for a marked BGP, then route flows are moved as evenly as possible onto the resultant set of next hops.
Each sticky ECMP route utilizes 64 distribution buckets in order to apportion flows onto the available next hops. Figure 20, Figure 21, and Figure 22 provide an example of the distribution of flows over multiple BGP next hops as next hops are removed.
Table 6 lists the sticky ECMP flow distribution as next hops are removed for 1.1.1.1/32.
Initial Sticky ECMP Distribution for 1.1.1.1/32 in Figure 20 | ECMP Distribution for 1.1.1.1/32 if Next Hop 3 Fails in Figure 21 | ECMP Distribution for 1.1.1.1/32 if Next Hop 2 Subsequently Fails in Figure 22 | |||
Bucket | NH | Bucket | NH | Bucket | NH |
00 | 1 | 00 | 1 | 00 | 1 |
01 | 2 | 01 | 2 | 01 | 1 |
02 | 3 | 02 | 1 | 02 | 1 |
03 | 1 | 03 | 1 | 03 | 1 |
04 | 2 | 04 | 2 | 04 | 1 |
05 | 3 | 05 | 2 | 05 | 1 |
06 | 1 | 06 | 1 | 06 | 1 |
07 | 2 | 07 | 2 | 07 | 1 |
08 | 3 | 08 | 1 | 08 | 1 |
09 | 1 | 09 | 1 | 09 | 1 |
10 | 2 | 10 | 2 | 10 | 1 |
11 | 3 | 11 | 2 | 11 | 1 |
12 | 1 | 12 | 1 | 12 | 1 |
13 | 2 | 13 | 2 | 13 | 1 |
14 | 3 | 14 | 1 | 14 | 1 |
15 | 1 | 15 | 1 | 15 | 1 |
16 | 2 | 16 | 2 | 16 | 1 |
17 | 3 | 17 | 2 | 17 | 1 |
18 | 1 | 18 | 1 | 18 | 1 |
19 | 2 | 19 | 2 | 19 | 1 |
20 | 3 | 20 | 1 | 20 | 1 |
21 | 1 | 21 | 1 | 21 | 1 |
22 | 2 | 22 | 2 | 22 | 1 |
23 | 3 | 23 | 2 | 23 | 1 |
24 | 1 | 24 | 1 | 24 | 1 |
25 | 2 | 25 | 2 | 25 | 1 |
26 | 3 | 26 | 1 | 26 | 1 |
27 | 1 | 27 | 1 | 27 | 1 |
28 | 2 | 28 | 2 | 28 | 1 |
29 | 3 | 29 | 2 | 29 | 1 |
30 | 1 | 30 | 1 | 30 | 1 |
31 | 2 | 31 | 2 | 31 | 1 |
32 | 3 | 32 | 1 | 32 | 1 |
33 | 1 | 33 | 1 | 33 | 1 |
34 | 2 | 34 | 2 | 34 | 1 |
35 | 3 | 35 | 2 | 35 | 1 |
36 | 1 | 36 | 1 | 36 | 1 |
37 | 2 | 37 | 2 | 37 | 1 |
38 | 3 | 38 | 1 | 38 | 1 |
39 | 1 | 39 | 1 | 39 | 1 |
40 | 2 | 40 | 2 | 40 | 1 |
41 | 3 | 41 | 2 | 41 | 1 |
42 | 1 | 42 | 1 | 42 | 1 |
43 | 2 | 43 | 2 | 43 | 1 |
44 | 3 | 44 | 1 | 44 | 1 |
45 | 1 | 45 | 1 | 45 | 1 |
46 | 2 | 46 | 2 | 46 | 1 |
47 | 3 | 47 | 2 | 47 | 1 |
48 | 1 | 48 | 1 | 48 | 1 |
49 | 2 | 49 | 2 | 49 | 1 |
50 | 3 | 50 | 1 | 50 | 1 |
51 | 1 | 51 | 1 | 51 | 1 |
52 | 2 | 52 | 2 | 52 | 1 |
53 | 3 | 53 | 2 | 53 | 1 |
54 | 1 | 54 | 1 | 54 | 1 |
55 | 2 | 55 | 2 | 55 | 1 |
56 | 3 | 56 | 1 | 56 | 1 |
57 | 1 | 57 | 1 | 57 | 1 |
58 | 2 | 58 | 2 | 58 | 1 |
59 | 3 | 59 | 2 | 59 | 1 |
60 | 1 | 60 | 1 | 60 | 1 |
61 | 2 | 61 | 2 | 61 | 1 |
62 | 3 | 62 | 1 | 62 | 1 |
63 | 1 | 63 | 1 | 63 | 1 |
Figure 23, Figure 24, and Figure 25 provide an example of the distribution of flows over multiple BGP next hops as next hops are added.
Table 7 lists the sticky ECMP flow distribution as next hops are added for 1.1.1.1/32.
Initial Sticky ECMP Distribution for 1.1.1.1/32 in Figure 23 | ECMP Distribution for 1.1.1.1/32 if Next Hop 2 Becomes Available in Figure 24 | ECMP Distribution for 1.1.1.1/32 if Next Hop 3 Additionally Becomes Available in Figure 25 | |||
Bucket | NH | Bucket | NH | Bucket | NH |
00 | 1 | 00 | 1 | 00 | 1 |
01 | 1 | 01 | 2 | 01 | 2 |
02 | 1 | 02 | 1 | 02 | 3 |
03 | 1 | 03 | 2 | 03 | 2 |
04 | 1 | 04 | 1 | 04 | 1 |
05 | 1 | 05 | 2 | 05 | 3 |
06 | 1 | 06 | 1 | 06 | 1 |
07 | 1 | 07 | 2 | 07 | 2 |
08 | 1 | 08 | 1 | 08 | 3 |
09 | 1 | 09 | 2 | 09 | 2 |
10 | 1 | 10 | 1 | 10 | 1 |
11 | 1 | 11 | 2 | 11 | 3 |
12 | 1 | 12 | 1 | 12 | 1 |
13 | 1 | 13 | 2 | 13 | 2 |
14 | 1 | 14 | 1 | 14 | 3 |
15 | 1 | 15 | 2 | 15 | 2 |
16 | 1 | 16 | 1 | 16 | 1 |
17 | 1 | 17 | 2 | 17 | 3 |
18 | 1 | 18 | 1 | 18 | 1 |
19 | 1 | 19 | 2 | 19 | 2 |
20 | 1 | 20 | 1 | 20 | 3 |
21 | 1 | 21 | 2 | 21 | 2 |
22 | 1 | 22 | 1 | 22 | 1 |
23 | 1 | 23 | 2 | 23 | 3 |
24 | 1 | 24 | 1 | 24 | 1 |
25 | 1 | 25 | 2 | 25 | 2 |
26 | 1 | 26 | 1 | 26 | 3 |
27 | 1 | 27 | 2 | 27 | 2 |
28 | 1 | 28 | 1 | 28 | 1 |
29 | 1 | 29 | 2 | 29 | 3 |
30 | 1 | 30 | 1 | 30 | 1 |
31 | 1 | 31 | 2 | 31 | 2 |
32 | 1 | 32 | 1 | 32 | 3 |
33 | 1 | 33 | 2 | 33 | 2 |
34 | 1 | 34 | 1 | 34 | 1 |
35 | 1 | 35 | 2 | 35 | 3 |
36 | 1 | 36 | 1 | 36 | 1 |
37 | 1 | 37 | 2 | 37 | 2 |
38 | 1 | 38 | 1 | 38 | 3 |
39 | 1 | 39 | 2 | 39 | 2 |
40 | 1 | 40 | 1 | 40 | 1 |
41 | 1 | 41 | 2 | 41 | 3 |
42 | 1 | 42 | 1 | 42 | 1 |
43 | 1 | 43 | 2 | 43 | 2 |
44 | 1 | 44 | 1 | 44 | 3 |
45 | 1 | 45 | 2 | 45 | 2 |
46 | 1 | 46 | 1 | 46 | 1 |
47 | 1 | 47 | 2 | 47 | 3 |
48 | 1 | 48 | 1 | 48 | 1 |
49 | 1 | 49 | 2 | 49 | 2 |
50 | 1 | 50 | 1 | 50 | 3 |
51 | 1 | 51 | 2 | 51 | 2 |
52 | 1 | 52 | 1 | 52 | 1 |
53 | 1 | 53 | 2 | 53 | 3 |
54 | 1 | 54 | 1 | 54 | 1 |
55 | 1 | 55 | 2 | 55 | 2 |
56 | 1 | 56 | 1 | 56 | 3 |
57 | 1 | 57 | 2 | 57 | 2 |
58 | 1 | 58 | 1 | 58 | 1 |
59 | 1 | 59 | 2 | 59 | 3 |
60 | 1 | 60 | 1 | 60 | 1 |
61 | 1 | 61 | 2 | 61 | 2 |
62 | 1 | 62 | 1 | 62 | 3 |
63 | 1 | 63 | 2 | 63 | 2 |
A BGP route to an IPv4 or IPv6 prefix is a candidate for installation as an ECMP next hop only if it meets all of the following criteria:
SR OS also supports IBGP multi-path. In some topologies, a BGP next hop is resolved by an IP route that has multiple ECMP next hops. When ibgp-multipath is not configured, only one of the ECMP next hops is programmed as the next hop of the BGP route in the IOM. When ibgp-multipath is configured, the IOM attempts to use all the ECMP next hops of the resolving route in the forwarding state. Although the name of the ibgp-multipath command implies that it is specific to IBGP-learned routes, this is not the case. It also applies to routes learned from any multi-hop BGP session including routes learned from multi-hop EBGP peers.
Be aware that multi-path and ibgp-multipath are not mutually exclusive and work together. The first context enables ECMP load-sharing across different BGP next hops (corresponding to different BGP routes) while the ibgp-multipath enables ECMP load-sharing across the next hops of IP routes that resolve the BGP next hops.
Finally, ibgp-multipath does not control traffic load sharing toward a BGP next hop that is resolved by a tunnel, as when dealing with BGP shortcuts or labeled routes (VPN-IP, label-IPv4, or label-IPv6). When a BGP next hop is resolved by a tunnel that supports ECMP, the load-sharing of traffic across the ECMP next hops of the tunnel is automatic.
SR OS supports direct resolution of a BGP next hop to multiple RSVP-TE or SR-TE tunnels. In addition, a BGP next hop can be resolved by multiple LDP ECMP next hops that each correspond to a separate LDP-over-RSVP or LDP-over-SRTE tunnel. It is also possible for a BGP next hop to be resolved by an IGP shortcut route that has multiple RSVP-TE or SR-TE tunnels as its ECMP next hops.
In some cases, the ECMP BGP next-hops of an IP route correspond to paths with very different bandwidths and it makes sense for the ECMP load-balancing algorithm to distribute traffic across the BGP next-hops in proportion to their relative bandwidths. The bandwidth associated with a path can be signaled to other BGP routers by including a link-bandwidth extended community in the BGP route. The link-bandwidth extended community is optional and non-transitive and encodes an autonomous system (AS) number and a bandwidth.
The SR OS implementation supports the link-bandwidth extended community in routes associated with the following address families: IPv4, IPv6, label-IPv4, label-IPv6, VPN-IPv4, and VPN-IPv6. The router automatically performs weighted ECMP for an IP BGP route when all of the ECMP BGP next-hops of the route include a link-bandwidth extended community. The relative weight of traffic sent to each BGP next-hop is visible in the output of the show router route-table extensive and show router fib extensive commands.
A route with a link-bandwidth extended community can be received from any IBGP peer. If such a route is received from an EBGP peer, the link-bandwidth extended community is stripped from the route unless an accept-from-ebgp command applies to that EBGP peer. However, a link-bandwidth extended community can be added to routes received from a directly connected (single hop) EBGP peer, potentially replacing the received Extended Community. This is accomplished using the add-to-received-ebgp command, which is available in group and neighbor configuration contexts.
When a route with a link-bandwidth extended community is advertised to an EBGP peer, the link-bandwidth extended community is removed by default. However, transitivity across an AS boundary can be allowed by configuring the send-to-ebgp command.
When a route with a link-bandwidth extended community is advertised to a peer using next-hop-self, the Extended Community is usually removed if it was not added locally (that is, by policy or add-to-received-ebgp command). However, in the special case that a route is readvertised (with next-hop-self) toward a peer covered by the scope of an aggregate-used-paths command, and the re-advertising router has installed multiple ECMP paths toward the destination each associated with a link-bandwidth extended community, the route is readvertised with a link-bandwidth extended community encoding the total bandwidth of all the used multi-paths.
The link-bandwidth extended community associated with a BGP route can be displayed using the show router bgp routes command. For the bandwidth value, the system automatically converts the binary value in the extended community to a decimal number in units of Mb/s (1 000 000 b/s).
Weighted ECMP across the BGP next-hops of an IP BGP route is supported in combination with ECMP at the level of the route or tunnel that resolves one or more of the ECMP BGP next-hops. This ECMP at the resolving level can also be weighted ECMP when the following conditions all apply:
Received label-unicast routes can be installed by BGP as tunnels in the tunnel table. In SR OS, the tunnel table is used to resolve a BGP next-hop to a tunnel when required by the configuration or the type of route (see Next-Hop Resolution). BGP tunnels play a key role in the following solutions:
BGP tunnels have a preference of 10 in the tunnel table, compared to 9 for LDP tunnels and 7 for RSVP tunnels. If the router configuration allows all types of tunnels to resolve a BGP next-hop, an RSVP LSP is preferred over an LDP tunnel, and an LDP tunnel is preferred over a BGP tunnel.
Further details about BGP-LU tunnels depending on the address family, are described below.
A label-IPv4 is automatically added as a BGP tunnel entry to the tunnel table if all of the following conditions are met:
If multipath and ECMP are configured so that they apply to label IPv4 routes, then a BGP tunnel can be installed in the tunnel table with multiple ECMP next-hops, each one corresponding to a path through a different BGP next-hop. The multipath selection process outlined in BGP Route Installation in the Route Table also applies to this case.
A label-IPv6 is automatically added as a BGP tunnel entry to the tunnel table if all of the following conditions are met.
If multipath and ECMP are configured so that they apply to label IPv6 routes, a BGP tunnel can be installed in the tunnel table with multiple ECMP next-hops, each one corresponding to a path through a different BGP next-hop. However, when disable-explicit-null is configured, the label-IPv6 routes used for ECMP toward an IPv6 destination cannot be a mix of routes with regular label values and routes with special (IPv6 explicit null) label values.
BGP fast reroute is a feature that brings together indirection techniques in the forwarding plane and pre-computation of BGP backup paths in the control plane to support fast reroute of BGP traffic around unreachable/failed BGP next-hops. BGP fast reroute is supported with IPv4, label-IPv4, IPv6, label-IPv6, VPN-IPv4 and VPN-IPv6 routes. The scenarios supported by the base router BGP context are outlined in Table 8.
Refer to the VPRN section of the 7450 ESS, 7750 SR, 7950 XRS, and VSR Layer 3 Services Guide: IES and VPRN for more information about BGP fast reroute information specific to IP VPNs.
Ingress Packet | Primary Route | Backup Route | Prefix Independent Convergence |
IPv4 | IPv4 route with next-hop A resolved by an IPv4 route or any shortcut tunnel | IPv4 route with next-hop B resolved by an IPv4 route or any shortcut tunnel | Yes |
IPv4 | Label-IPv4 route with next-hop A resolved by any transport tunnel | Label-IPv4 route with next-hop B resolved by any transport tunnel | Yes, but if the label-IPv4 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC |
IPv4 | Label-IPv4 route with next-hop A resolved by a local route | Label-IPv4 route with next-hop B resolved by a local route | Yes, but if the label-IPv4 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC |
IPv4 | Label-IPv4 route with next-hop A resolved by a static route | Label-IPv4 route with next-hop B resolved by a static route | Yes, but if the label-IPv4 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC |
IPv6 | IPv6 route with next-hop A resolved by an IPv6 route | IPv6 route with next-hop B resolved by an IPv6 route | Yes |
IPv6 | Label-IPv6 route with next-hop A resolved by any transport tunnel | Label-IPv6 route with next-hop B resolved by any transport tunnel | Yes, but if the label-IPv6 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC |
IPv6 | Label-IPv6 route with next-hop A resolved by a local route | Label-IPv6 route with next-hop B resolved by a local route | Yes, but if the label-IPv6 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC |
IPv6 | Label-IPv6 route with next-hop A resolved by a static route | Label-IPv6 route with next-hop B resolved by a static route | Yes, but if the label-IPv6 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC |
In SR OS, fast reroute is optional and must be enabled by using either the BGP backup-path command or the route-policy install-backup-path command. Typically, only one approach is used.
The backup-path command in the base router context is used to control fast reroute on a per-RIB basis (IPv4, label-IPv4, IPv6, and label-IPv6). When the command specifies a particular family, BGP attempts to find a backup path for every prefix learned by the associated BGP RIB.
The install-backup-path command, available in route-policy-action contexts, marks a BGP route as requesting a backup path. It only takes effect in BGP import and VRF import policies. If only some prefixes should have backup paths, then the backup-path command should not be used, and instead the install-backup-path command should be used to mark only those prefixes that require extra protection.
In general, a prefix supports ECMP paths or a backup path, but not both. The backup path is the best path after the primary path and any paths with the same BGP next-hop as the primary path have been removed.
When BGP fast reroute is enabled the IOM reroutes traffic onto a backup path based on input from BGP. When BGP decides that a primary path is no longer usable it notifies the IOM and affected traffic is immediately switched to the backup path.
The following events trigger failure notifications to the IOM and reroute of traffic to backup paths:
QPPB is a feature that allows a QoS treatment (forwarding class and optionally priority) to be associated with a BGP IPv4, label-IPv4, IPv6, or label-IPv6 route that is installed in the routing table. This is done so that when traffic arrives on a QPPB-enabled IP interface and its source or destination IP address matches a BGP route with QoS information, the packet is handled according to the QoS of the matching route. SR OS supports QPPB on the following types of interfaces:
QPPB is enabled on an interface using the qos-route-lookup command. There are separate commands for IPv4 and IPv6 so QPPB can be enabled in one mode, source, destination, or none, for IPv4 packets arriving on the interface, and a different mode source, destination, or none, for IPv6 packets arriving on the interface.
Note: Source-based QPPB is not supported on subscriber interfaces. |
Different BGP routes for the same IP prefix can be associated with different QPPB information. If these BGP routes are combined in support of ECMP or BGP fast reroute then the QPPB information becomes next-hop specific. If these LOC-RIB routes are combined in support of ECMP or BGP fast reroute then the QPPB information becomes next-hop specific. This means that in destination QPPB mode the QoS assigned to a packet depends on the BGP next-hop that is selected for that particular packet by the ECMP hash or fast reroute algorithm. In source QPPB mode the QoS assigned to a packet comes from the first BGP next-hop of the IP route matching the source address.
Policy accounting is a feature that allows “classes” to be associated with certain IPv4 or IPv6 routes, static or BGP learned, when they are installed in the routing table. This is done for the following reasons:
For both applications the following IP interface types are supported:
Policy accounting, and policing (if needed and supported), is enabled on an interface using the policy-accounting command. The name of a policy accounting template must be specified as an argument of this command. SR OS supports up to 1024 different templates. Each policy accounting template can have a list of source classes (up to 255), a list of destination classes (up to 255), and a list of policers (up to 63). Each source class, destination class, and policer, in their respective list, has an index number. Source class indices and destination class indices have a global meaning. In other words, destination-class index 5 in one template refers to the same set of routes as destination-class index 5 in another policy accounting template. Policer indices have a local scope to the enclosing template. In one template, destination-class index 5 could use policer index 2 and in another template destination-class index 5 could use policer index 62. If a destination class has an associated policer then incoming traffic on each IP interface on which the template is applied is rate-limited based on that policer if the destination IP address matches a route with that destination class.
Policy accounting templates containing one or more source class identifiers cannot be applied to subscriber interfaces.
The policy accounting template tells the IOM the number of statistics and policer resources to use for each interface. These resources are derived from two pools that are sized per-FP. The first pool consists of policer statistics indices. Every policy-accounting interface on a card or FP uses one of these resources for every source and destination class index listed in the template referenced by the interface. These are basic resources needed for statistics collection. The total reservation at the FP level is set using the configure card slot-number fp fp-number policy-accounting command.
The second pool (FP4 cards only) consists of policer index resources. Every policy-accounting interface on a complex uses one of these resources for every destination class associated with a policer in the template referenced by the interface. The total reservation of this second resource at the FP level is set using the configure card slot-number fp fp-number ingress policy-accounting policers command.
The total number of the above two resources, per FP, must be less than or equal to 128000. In addition, the second resource pool size must be less than or equal to the size of the first resource pool.
It is possible to increase or decrease the size of either resource sub-pool at any time. A decrease can cause some interfaces (randomly selected) to immediately lose their resources and stop counting or policing some traffic that was previously being counted or policed.
If the policy accounting is enabled on a spoke SDP or R-VPLS interface, all FPs in the system should have a reservation for each of the above resources, otherwise the show router interface policy-accounting command output reports that the statistics are possibly incomplete.
Through route policy or configuration mechanisms, a BGP or static route for an IP prefix can have a source class index (1 to 255), a destination class index (1 to 255) or both. When an ingress packet on a policy accounting-enabled interface [I1] is forwarded by the IOM and its destination address matches a BGP or static route with a destination class index [D], and [D] is listed in the relevant policy accounting template, then the packets-forwarded and IP-bytes-forwarded counters for [D] on interface [I1] are incremented accordingly. If [D] is also associated with a policer (FP4 only) the packet is also subjected to rate limiting as discussed above. The policer statistics displayed by the show router interface policy-accounting command include Layer 2 encapsulation and is different from the destination-class byte-level statistics.
When an ingress packet on a policy accounting-enabled interface [I2] is forwarded by the IOM and its source address matches a BGP or static route with a source class index [S], and [S] is listed in the relevant policy accounting template, the packets-forwarded and IP-bytes-forwarded counters for [S] on interface [I2] are incremented accordingly. Policing based on the source class is unsupported.
It is possible that different BGP or static routes for the same IP prefix (through different next hops) are associated with different class information. If these routes are combined in support of ECMP or fast reroute then the destination class of a packet depends on the next hop that is selected for that particular packet by the ECMP hash or fast reroute algorithm. If the source address of a packet matches a route with multiple next hops its source class is derived from the first next hop of the matching route.
Route flap damping is a mechanism supported by 7450, 7750, and 7950 routers, as well as other BGP routers, that was designed to help improve the stability of Internet routing by mitigating the impact of route flaps. Route flaps describe a situation where a router alternately advertises a route as reachable and then unreachable or as reachable through one path and then another path in rapid succession. Route flaps can result from hardware errors, software errors, configuration errors, unreliable links, and so on. However not all perceived route flaps represent a true problem; when a best path is withdrawn the next-best path may not be immediately known and may trigger a number of intermediate best path selections (and corresponding advertisements) before it is found. These intermediate best path selections may travel at different speeds through different routers due to the effect of the min-route-advertisement interval (MRAI) and other factors. RFD does not handle this type of situation particularly well and for this and other reasons many Internet service providers do not use RFD.
In SR OS route flap damping is configurable; by default, it is disabled. It can be enabled on EBGP and confed-EBGP sessions by including the damping command in their group or neighbor configuration. The damping command has no effect on IBGP sessions. When a route of any type (any AFI/SAFI) is received on a non-IBGP session that has damping enabled:
SR OS implements the following features related to RIB-OUT processing:
These features are discussed in the following sections.
The export command is used to apply one or more policies (up to 15) to a neighbor, group or to the entire BGP context. The export command that is most-specific to a peer is the one that is applied. An export policy command applied at the neighbor level takes precedence over the same command applied at the group or global level. An export policy command applied at the group level takes precedence over the same command specified on the global level. The export policies applied at different levels are not cumulative. The policies listed in an export command are evaluated in the order in which they are specified.
Note: The export command can reference a policy before it has been created (as a policy-statement). |
The most common uses for BGP export policies are as follows:
Outbound route filtering (ORF) is a mechanism that allows one router, the ORF-sending router to signal to a peer, the ORF-receiving router, a set of route filtering rules (ORF entries) that the ORF-receiving router should apply to its route advertisements towards the ORF-sending router. The ORF entries are encoded in Route Refresh messages.
The use of ORF on a session must be negotiated — that is, both routers must advertise the ORF capability in their Open messages. The ORF capability describes the address families that support ORF, and for each address family, the ORF types that are supported and the ability to send/receive each type. 7450, 7750, and 7950 routers support ORF type 3, which is ORF based on Extended Communities. It is supported for only the following address families:
In SR OS the send/receive capability for ORF type 3 is configurable (with the send-orf and accept-orf commands) but the setting applies to all supported address families.
SR OS support for ORF type 3 allows a PE router that imports VPN routes with a particular set of Route Target Extended Communities to indicate to a peer (for example a route reflector) that it only wants to receive VPN routes that contain one or more of these Extended Communities. When the PE router wants to inform its peer about a new RT Extended Community it sends a Route Refresh message to the peer containing an ORF type 3 entry instructing the peer to add a permit entry for the 8-byte extended community value. When the PE router wants to inform its peer about a RT Extended Community that is no longer needed it sends a Route Refresh message to the peer containing an ORF type 3 entry instructing the peer to remove the permit entry for the 8-byte extended community value.
In SR OS the type-3 ORF entries that are sent to a peer can be generated dynamically (if no Route Target Extended Communities are specified with the send-orf command) or else specified statically. Dynamically generated ORF entries are based on the route targets that are imported by all locally-configured VPRNs.
A router that has installed ORF entries received from a peer can still apply BGP export policies to the session. If the evaluation of a BGP export policy results in a reject action for a VPN route that matches a permit ORF entry the route is not advertised — i.e. the export policy has the final word.
Note: The SR OS implementation of ORF filtering is very efficient. It takes less time to filter a large number of VPN routes with ORF than it does to reject non-matching VPN routes using a conventional BGP export policy. |
Despite the advantages of ORF compared to manually configured BGP export policies a better technology, when it comes to dynamic filtering based on Route Target Extended Communities, is RT Constraint. RT Constraint is discussed further in the next section.
RT constrained route distribution, or RT-constrain for short, is a mechanism that allows a router to advertise to certain peers a special type of MP-BGP route called an RTC route; the associated AFI is 1 and the SAFI is 132. The NLRI of an RTC route encodes an Origin AS and a Route Target Extended Community with prefix-type encoding (for instance, if there is a prefix-length and “host” bits after the prefix-length are set to zero). A peer receiving RTC routes does not advertise VPN routes to the RTC-sending router unless they contain a Route Target Extended Community that matches one of the received RTC routes. As with any other type of BGP route RTC routes are propagated loop-free throughout and between Autonomous Systems. If there are multiple RTC routes for the same NLRI the BGP decision process selects one as the best path. The propagation of the best path installs RIB-OUT filter rules as it is travels from one router to the next and this process creates an optimal VPN route distribution tree rooted at the source of the RTC route.
Note: RT-constrain and Extended Community-based ORF are similar to the extent that they both allow a router to signal to a peer the Route Target Extended Communities they want to receive in VPN routes from that peer. But RT-constrain has distinct advantages over Extended Community-based ORF: it is more widely supported, it is simpler to configure, and its distribution scope is not limited to a direct peer. |
In SR OS the capability to exchange RTC routes is advertised when the route-target keyword is added to the relevant family command. RT-constrain is supported on EBGP and IBGP sessions of the base router instance. On any particular session either ORF or RT-constrain may be used but not both; if RT-constrain is configured the ORF capability is not announced to the peer.
When RT-constrain has been negotiated with one or more peers SR OS automatically originates and advertises to these peers one /96 RTC route (the origin AS and Route Target Extended Community are fully specified) for every route target imported by a locally-configured VPRN or BGP-based L2 VPN; this includes MVPN-specific route targets.
SR OS also supports a group/neighbor level default-route-target command that causes routers to generate and send a 0:0:0/0 default RTC route to one or more peers. Sending the default RTC route to a peer conveys a request to receive all VPN routes from that peer. The default-route-target command is typically configured on sessions that a route reflector has with its PE clients. A received default RTC route is never propagated to other routers.
The advertisement of RTC routes by a route reflector follows special rules that are described in RFC 4684. These rules are needed to ensure that RTC routes for the same NLRI that are originated by different PE routers in the same Autonomous System are properly distributed within the AS.
When a BGP session comes up, and RT-constrain is enabled on the session (both peers advertised the MP-BGP capability), routers delay sending any VPN-IPv4 and VPN-IPv6 routes until either the session has been up for 60 seconds or the End-of-RIB marker is received for the RT-constrain address family. When the VPN-IPv4 and VPN-IPv6 routes are sent they are filtered to include only those with a Route Target Extended Community that matches an RTC route from the peer. VPN-IP routes matching an RTC route originated in the local AS are advertised to any IBGP peer that advertises a valid path for the RTC NLRI — in other words, route distribution is not constrained to only the IBGP peer advertising the best path. On the other hand, VPN-IP routes matching an RTC route originated outside the local AS are only advertised to the EBGP or IBGP peer that advertises the best path.
Note: SR OS does not support an equivalent of BGP-Multipath for RT-Constrain routes. There is no way to distribute VPN routes across more than one ‘almost’ equal set of inter-AS paths. |
According to the BGP standard (RFC 4271), a BGP router should not send updated reachability information for an NLRI to a BGP peer until a certain period of time (Min Route Advertisement Interval) has elapsed since the last update. The RFC suggests the MRAI should be configurable per peer but does not propose a specific algorithm, and therefore, MRAI implementation details vary from one router operating system to another.
In SR OS, the MRAI is configurable, on a per-session basis, using the min-route-advertisement command. The min-route-advertisement command can be configured with any value between 1 and 255 seconds and the setting applies to all address families. The default value is 30 seconds, regardless of the session type (EBGP or IBGP). The MRAI timer is started at the configured value when the session is established and counts down continuously, resetting to the configured value whenever it reaches zero. Every time it reaches zero, all pending RIB-OUT routes are sent to the peer.
To send UPDATE messages that advertise new NLRI reachability information more frequently for some address families than others, SR OS offers a rapid-update command that overrides the remaining time on a peer's MRAI timer and immediately sends routes belonging to specified address families (and all other pending updates) to the peers receiving these routes. The address families that can be configured with rapid-update support are:
In many cases, the default MRAI is appropriate for all address families (or at least those not included in the preceding list) when it applies to UPDATE messages that advertise reachable NLRI, but it is not the best option for UPDATE messages that advertise unreachable NLRI (route withdrawals). Fast re-convergence after some types of failures requires route withdrawals to propagate to other routers as quickly as possible so that they can calculate and start using new best paths, which would be impeded by the effect of the MRAI timer at each router hop. This is facilitated by the rapid-withdrawal configuration command.
When rapid-withdrawal is configured, UPDATE messages containing withdrawn NLRI are sent immediately to a peer without waiting for the MRAI timer to expire. UPDATE messages containing reachable NLRI continue to wait for the MRAI timer to expire, or for a rapid-update trigger, if it applies. When rapid-withdrawal is enabled, it applies to all address families.
When there is a change to a labeled-unicast route that requires reprogramming of the label operations in the dataplane, these IOM updates are not made until the changed route is advertised to a peer, which depends on MRAI. Lowering the MRAI value or using rapid-update improves the speed of this operation.
BGP does not allow a route to be advertised unless it is the best path in the RIB and an export policy allows the advertisement.
In some cases, it may be useful to advertise the best BGP path to peers despite the fact that is inactive —for example, because there are one or more preferred non-BGP routes to the same destination and one of these other routes is the active route. One way SR OS supports this flexibility is using the advertise-inactive command; other methods include Best-External and Add-Paths.
When the BGP advertise-inactive command is configured so that it applies to a BGP session it has the following effect on the IPv4, IPv6, mcast-ipv4, mcast-ipv6, label-IPv4 and label-IPv6 routes advertised to that peer:
Best-External is a BGP enhancement that allows a BGP speaker to advertise to its IBGP peers its best “external” route for a prefix/NLRI when its best overall route for the prefix/NLRI is an “internal” route. This is not possible in a normal BGP configuration because the base BGP specification prevents a BGP speaker from advertising a non-best route for a destination.
In certain topologies Best-External can improve convergence times, reduce route oscillation and allow better loadsharing. This is achieved because routers internal to the AS have knowledge of more exit paths from the AS. Enabling Add-Paths on border routers of the AS can achieve a similar result but Add-Paths introduces NLRI format changes that must be supported by BGP peers of the border router and therefore has more interoperability constraints than Best-External (which requires no messaging changes).
Best-External is supported in the base router BGP context. (A related feature is also supported in VPRNs; consult the Services Guide for more details.) It is configured using the advertise-external command, which provides IPv4, label-IPv4, IPv6, and label-IPv6 as options.
The advertisement rules when advertise-external is enabled can be summarized as follows:
Note: A route reflector with advertise-external enabled does not include IBGP routes learned from other clusters in its definition of ‘external’. |
Note: If the best-external route is not the best overall route it is not installed in the forwarding table and in some cases this can lead to a short-duration traffic loop after failure of the overall best path. |
Add-Paths is a BGP enhancement that allows a BGP router to advertise multiple distinct paths for the same prefix/NLRI. Add-Paths provides a number of potential benefits, including reduced routing churn, faster convergence, and better loadsharing.
For a router to receive multiple paths per NLRI from a peer, for a particular address family, the peer must announce the BGP capability to send multiple paths for the address family and the local router must announce the BGP capability to receive multiple paths for the address family. When the Add-Path capability has been negotiated this way, all advertisements and withdrawals of NLRI by the peer must include a path identifier. The path identifier has no significance to the receiving router. If the combination of NLRI and path identifier in an advertisement from a peer is unique (does not match an existing route in the RIB-IN from that peer) then the route is added to the RIB-IN. If the combination of NLRI and path identifier in a received advertisement is the same as an existing route in the RIB-IN from the peer then the new route replaces the existing one. If the combination of NLRI and path identifier in a received withdrawal matches an existing route in the RIB-IN from the peer, then that route is removed from the RIB-IN.
An UPDATE message carrying an IPv4 NLRI with a path identifier is shown in Figure 26.
Add-Paths is only supported by the base router BGP instance and the EBGP and IBGP sessions it forms with other peers capable of Add-Paths. The ability to send and receive multiple paths per prefix is configurable per family, with the supported options being:
The local RIB may have multiple paths for a prefix. The path selection mode refers to the algorithm used to decide which paths to advertise to an add-paths peer. SR OS supports a send N path selection algorithm (refer to draft-ietf-idr-add-paths-guidelines) and a send multipaths selection algorithm.
The send N algorithm selects the N best advertisable paths that meet these constraints:
The send multipaths algorithm selects the N best advertisable paths that meet these constraints:
Split-horizon refers to the action taken by a router to avoid advertising a route back to the peer from which it was received. By default, SR OS applies split-horizon behavior only to routes received from IBGP non-client peers, and split-horizon only works for routes to non-imported routes within a RIB. This split-horizon functionality, which can never be disabled, prevents a route learned from a non-client IBGP peer to be advertised to the sending peer or any other non-client peer.
To apply split-horizon behavior to routes learned from RR clients, confed-EBGP peers or (non-confed) EBGP peers the split-horizon command must be configured in the appropriate contexts; it is supported at the global BGP, group and neighbor levels. When split-horizon is enabled on these types of sessions, it only prevents the advertisement of a route back to its originating peer; for example, SR OS does not prevent the advertisement of a route learned from one EBGP peer back to a different EBGP peer in the same neighbor AS.
The BGP Monitoring Protocol (BMP) provides a monitoring station that obtains route updates and statistics from a BGP router. The BMP protocol is described in detail in RFC 7854, BGP Monitoring Protocol (BMP). A router communicates information about one or more BGP sessions to a BMP station. Specifically, BMP allows a BGP router to advertise the pre-policy or post-policy BGP RIB-In from specific BGP peers to a monitoring station. This allows the monitoring station to monitor the routing table size, identify issues, and monitor trends in the table size and update or withdraw the frequency. The BMP station is also sometimes called a BMP collector. A router sends information in BMP messages to a BMP station.
BMP is a unidirectional protocol. A BMP station never sends back any messages to a router.
BMP allows a router to report different types of information.
BMP on an SR OS router reports information about routes that were received from a neighbor. The SR OS cannot report routes that were sent to a neighbor.
When periodic statistics are enabled, the router sends all the statistics as described in RFC 7854, section 4.8, with the exception of statistic number 13, “Number of duplicate update messages received”. The supported statistics are listed in Table 9.
Statistic | Type |
Number of Prefixes rejected by inbound policy | 0 |
Number of duplicate prefix advertisements received | 1 |
Number of duplicate withdraws received | 2 |
Number of invalidated prefixes due to Cluster_List loop detection | 3 |
Number of invalidated prefixes due to AS_PATH loop detection | 4 |
Number of invalidated prefixes due to Originator ID validation | 5 |
Number of invalidated prefixes due to AS-Confed loop detection | 6 |
Total number of routes in adj-rib-in (all families) | 7 |
Total number of routes in Local-RIB (all families)` | 8 |
Number of routes per address-family in adj-rib-in | 9 |
Number of routes per address-family in loc-rib | 10 |
Number of updates subjected to treat-as-withdraw | 11 |
Number of prefixes subjected to treat-as-withdraw | 12 |
Note: Statistics 9 and 10 are per address family. The address family is specified as an AFI/SAFI pair. Regardless of which families are configured for route-monitoring, a router reports the statistics of all address families that were negotiated with the neighbor. The values in these counters are the same values that can be seen in the show>router>bgp>neighbor ip-address [detail] command in the CLI. |
SR OS implements the following BGP applications:
FlowSpec is a standardized method for using BGP to distribute traffic flow specifications (flow routes) throughout a network. A flow route carries a description of a flow in terms of packet header fields such as source IP address, destination IP address, or TCP/UDP port number and indicates (through a community attribute) an action to take on packets matching the flow. The primary application for FlowSpec is DDoS mitigation.
FlowSpec is supported for both IPv4 and IPv6. To exchange IPv4 FlowSpec routes with a BGP peer the flow-ipv4 keyword must be part of the family command that applies to the session and to exchange IPv6 FlowSpec routes with a BGP peer flow-ipv6 must be present in the family configuration.
The NLRI of an IPv4 flow route can contain one or more of the subcomponents shown in Table 10.
Subcomponent Name [Type] | Value Encoding | SR OS Support |
Destination IPv4 Prefix [1] | Prefix length, prefix | Yes |
Source IPv4 Prefix [2] | Prefix length, prefix | Yes |
IP Protocol [3] | One or more (operator, value) pairs | Partial. No support for multiple values other than “TCP or UDP”. |
Port [4] 1 | One or more (operator, value) pairs | Yes |
Destination Port [5] | One or more (operator, value) pairs | Yes |
Source Port [6] | One or more (operator, value) pairs | Yes |
ICMP Type [7] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
ICMP Code [8] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
TCP Flags [9] 2 | One or more (operator, bitmask) pairs | Yes |
Packet Length [10] | One or more (operator, value) pairs | Yes |
DSCP [11] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
Fragment [12] | One or more (operator, bitmask) pairs | Partial. No support for matching DF bit, first-fragment or last-fragment. |
Notes:
The NLRI of an IPv6 flow route can contain one or more of the subcomponents shown in Table 11.
Subcomponent Name [Type] | Value Encoding | SR OS Support |
Destination IPv6 Prefix [1] | Prefix length, prefix offset, prefix | Partial. No support for prefix offset. |
Source IPv6 Prefix [2] | Prefix length, prefix offset, prefix | Partial. No support for prefix offset. |
Next Header [3] | One or more (operator, value) pairs | Partial. Only a single value supported. |
Port [4] 1 | One or more (operator, value) pairs | Yes |
Destination Port [5] | One or more (operator, value) pairs | Yes |
Source Port [6] | One or more (operator, value) pairs | Yes |
ICMP Type [7] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
ICMP Code [8] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
TCP Flags [9] | One or more (operator, bitmask) pairs | Partial. Only SYN and ACK flags can be matched. |
Packet Length [10] | One or more (operator, value) pairs | Yes |
Traffic Class [11] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
Fragment [11] | One or more (operator, bitmask) pairs | Partial. No support for matching Last Fragment. |
Flow Label [13] | One or more (operator, value) pairs | Partial. Only a single value is supported. |
Note:
Table 12 summarizes the actions that may be associated with IPv4 flow-spec routes. Table 13 summarizes the actions that may be associated with IPv6 flow-spec routes.
Action | Encoding | SR OS Support |
rate limit | Extended community type 0x8006 | Yes |
sample/log | Extended community type 0x8007 S-bit | Yes |
next entry | Extended community type 0x8007 T-bit | — |
Redirect to VRF | Extended community type 0x8008 | Yes |
Mark traffic class | Extended community type 0x8009 | Yes |
Redirect to IPv4 | Extended community type 0x010c | Yes |
Redirect to IPv6 | Extended community type 0x000c | — |
Redirect to LSP | Extended community type 0x0900 | Partial, only support for ID-type 0x00 (localized ID) |
Action | Encoding | SR OS Support |
rate limit | Extended community type 0x8006 | Yes |
sample/log | Extended community type 0x8007 S-bit | Yes |
next entry | Extended community type 0x8007 T-bit | — |
Redirect to VRF | Extended community type 0x8008 | Yes |
Mark traffic class | Extended community type 0x8009 | Yes |
Redirect to IPv4 | Extended community type 0x010c | — |
Redirect to IPv6 | Extended community type 0x000c | Yes |
Redirect to LSP | Extended community type 0x0900 | Partial, only support for ID-type 0x00 (localized ID) |
Received FlowSpec-IPv4 and FlowSpec-IPv6 routes are validated following the procedures documented in RFC 5575 and draft-ietf-idr-bgp-flowspec-oid-03, Revised Validation Procedure for BGP Flow Specifications. Configure the validate-dest-prefix command in a routing instance for the validation checks based on destination prefix to be applied. By default, no checking is done. When the command is enabled, BGP determines whether a FlowSpec route is valid or invalid based on the following logic:
FlowSpec-IPv4 routes that are received with a redirect-to-IPv4 extended community action are also be subject to a further set of validation checks. If the validate-redirect-ip command is enabled in the receiving BGP instance, then a FlowSpec-IPv4 route is considered invalid if it is deemed to have originated in a different AS than the IP route that resolves the redirection IPv4 address. The originating AS of a FlowSpec route is determined from its AS paths.
A FlowSpec route that is determined to be invalid by any of the validation rules described earlier is retained in the BGP RIB, but not used for traffic filtering and not propagated to other BGP speakers.
FlowSpec routes received with a redirect-to-IPv4 or redirect-to-IPv6 extended community action are also subject to a further set of validation checks. If the config>router>bgp>flowspec>validate-redirect-ip command is enabled in the receiving BGP instance, then a FlowSpec route is considered invalid if it is deemed to have originated in a different AS than the IP route that resolves the redirection address. The originating AS of a FlowSpec route is determined from its AS path.
When the base router BGP instance receives an IPv4 or IPv6 flow route and that route is valid/best, the system attempts to construct an IPv4 or IPv6 filter entry from the NLRI contents and the actions encoded in the UPDATE message. If successful, the filter entry is added to the system-created “fSpec-0” IPv4 embedded filter or to the “fSpec-0” IPv6 embedded filter. These embedded filters can be inserted into configured IPv4 and IPv6 filter policies that are applied to ingress traffic on a selected set of the base router IP interfaces. These interfaces can include network interfaces, IES SAP interfaces, and IES spoke SDP interfaces.
Similarly, filter entries can be added to system-created “fSpec-$vprnId” embedded filters for use with VPRN interfaces.
When FlowSpec rules are embedded into a user-defined filter policy, the insertion point of the rules is configurable through the offset parameter of the embed-filter command. The sum of the ip-filter-max-size and offset must not exceed the maximum filter entry-id range.
This feature allows the separate configuration of TTL propagation for in transit and CPM generated IP packets at the ingress LER within a BGP label route context.
For IPv4 and IPv6 packets forwarded using a RFC 3107 label route in the global routing instance, including label-IPv6, the following command specified with the all value enables TTL propagation from the IP header into all labels in the transport label stack:
The none value reverts to the default mode which disables TTL propagation from the IP header to the labels in the transport label stack.
These commands do not have a no version.
Note:
|
This feature does not impact packets forwarded over BGP shortcuts. The ingress LER operates in uniform mode by default and can be changed into pipe mode using the configuration of TTL propagation for RSVP or LDP LSP shortcut.
This feature configures the TTL propagation for transit packets at a router acting as an LSR for a BGP label route.
When an LSR swaps the BGP label for a IPv4 prefix packet, thus acting as a ABR, ASBR, or data-path Route-Reflector (RR) in the base routing instance, or swaps the BGP label for a vpn-IPv4 or vpn-IPv6 prefix packet, thus acting as an inter-AS Option B VPRN ASBR or VPRN data path Route-Reflector (RR), the all value of the following command enables TTL propagation of the decremented TTL of the swapped BGP label into all LDP or RSVP transport labels.
When an LSR swaps a label or stitches a label, it always writes the decremented TTL value into the outgoing swapped or stitched label. What the above CLI controls is whether this decremented TTL value is also propagated to the transport label stack pushed on top of the swapped or stitched label.
The none value reverts to the default mode which disables TTL propagation. This changes the existing default behavior which propagates the TTL to the transport label stack. When a customer upgrades, the new default becomes in effect. The above commands do not have a no version.
The following describes the behavior of LSR TTL propagation in a number of other use cases and indicates if the above CLI command applies or not:
BGP prefix origin validation is a solution developed by the IETF SIDR working group for reducing the vulnerability of BGP networks to prefix mis-announcements and certain man-in-the-middle attacks. BGP has traditionally relied on a trust model where it is assumed that when an AS originates a route it has the right to announce the associated prefix. BGP prefix origin validation takes extra steps to ensure that the origin AS of a route is valid for the advertised prefix.
7450, 7750, and 7950 routers support BGP prefix origin validation for IPv4 and IPv6 routes received from selected peers. When prefix origin validation is enabled on a base router BGP or VPRN BGP session using the enable-origin-validation command, every received IPv4 and/or IPv6 route received from the peer is checked to determine whether the origin AS is valid for the received prefix. The origin AS is the first AS that was added to the AS_PATH attribute and indicates the autonomous system that originated the route.
For purposes of determining the origin validation state of received BGP routes, the router maintains an Origin Validation database consisting of static and dynamic entries. Each entry is called a VRP (Validated ROA Payload) and associates a prefix (range) with an origin AS.
Static VRP entries are configured using the static-entry command available in the config>router>origin-validation context of the base router. In SR OS, a static entry can express that a specific prefix and origin AS combination is either valid or invalid.
Dynamic VRP entries are learned from RPKI local cache servers and express valid origin AS and prefix combinations. The router communicates with RPKI local cache servers using the RPKI-RTR protocol. SR OS supports the RPKI-RTR protocol over TCP/IPv4 or TCP/IPv6 transport; TCP-MD5 and other forms of session security are not supported. 7450, 7750, and 7950 routers can set up an RPKI-RTR session using the base routing table (in-band) or the management router (out-of-band). For more information, refer to the origin-validation configuration command and show commands in the Router Configuration Guide.
An RPKI local cache server is one element of the larger RPKI system. The RPKI is a distributed database containing cryptographic objects relating to Internet Number resources. Local cache servers are deployed in the service provider network and retrieve digitally signed Route Origin Authorization (ROA) objects from Global RPKI servers. The local cache servers cryptographically validate the ROAs before passing the information along to the routers.
The algorithm used to determine the origin validation states of routes received over a session with enable-origin-validation configured uses the following definitions:
Using the above definitions, the origin validation state of a route is based on the following rules.
Consider the following example. Suppose the Origin Validation database has the following entries:
10.1.0.0/16-32, origin AS=5, dynamic
10.1.1.0/24-32, origin AS=4, dynamic
10.0.0.0/8-32, origin AS=5, static invalid
10.1.1.0/24-32, origin AS=4, static invalid
In this case, the origin validation state of the following routes are as indicated:
10.1.0.0/16 with AS_PATH {…5}: Valid
10.1.1.0/24 with AS_PATH {…4}: Invalid
10.2.0.0/16 with AS_PATH {…5}: Invalid
10.2.0.0/16 with AS_PATH {…6}: Not-Found
The origin validation state of a route can affect its ranking in the BGP decision process. When origin-invalid-unusable is configured, all routes that have an origin validation state of ‘Invalid’ are considered unusable by the best path selection algorithm, that is, they cannot be used for forwarding and cannot be advertised to peers.
If origin-invalid-unusable is not configured then routes with an origin validation state of ‘Invalid’ are compared to other “usable” routes for the same prefix according to the BGP decision process.
When compare-origin-validation-state is configured a new step is added to the BGP decision process after removal of invalid routes and before the comparison of local preference. The new step compares the origin validation state, so that a route with a ‘Valid’ state is preferred over a route with a ‘Not-Found’ state, and a route with a ‘Not-Found’ state is preferred over a route with an ‘Invalid’ state assuming that these routes are considered ‘usable’. The new step is skipped if the compare-origin-validation-state command is not configured.
Route policies can be used to attach an Origin Validation State extended community to a route received from an EBGP peer in order to convey its origin validation state to IBGP peers and save them the effort of repeating the Origin Validation database lookup. To add an Origin Validation State extended community encoding the ‘Valid’ result, the route policy should add a community list that contains a member in the format ext:4300:0. To add an Origin Validation State extended community encoding the ‘Not-Found’ result, the route policy should add a community list that contains a member in the format ext:4300:1. To add an Origin Validation State extended community encoding the ‘Invalid’ result, the route policy should add a community list that contains a member in the format ext:4300:2.
It is possible to leak a copy of a BGP route (including all its path attributes) from one routing instance RIB to another routing instance RIB of the same type (labeled or unlabeled) in the same router. Leaking is supported from the GRT to a VPRN, from one VPRN to another VPRN, and from a VPRN to the GRT. Any valid BGP route for an IPv4, IPv6, or label-IPv4 prefix can be leaked. A BGP route does not have to be the best path or used for forwarding in the source instance in order to be leaked.
An IPv4, IPv6, or label-IPv4 BGP route becomes a candidate for leaking to another instance when it is specially marked by a BGP import policy. This special marking is achieved by accepting the route with a bgp-leak action in the route policy. Routes that are candidates for leaking to other instances show a leakable flag in the output of various show router BGP commands. In order to copy a leakable BGP route received in a source instance S into the BGP RIB of a target instance T, the target instance must be configured with a leak-import policy that matches and accepts the leakable route. Different leak-import policies can be specified for each of the following RIBs: IPv4, label-IPv4, and IPv6. Up to 15 leak-import policies can be chained together for more complex use cases. leak-import policies are configured in the config>router>bgp>rib-management>ipv4 context.
Note: Using a leak-import policy to change the BGP attributes of leaked routes (compared to the original source copy) is not supported. The only attribute that can be changed is the RTM preference. |
In the target instance, leaked BGP routes are compared to other (leaked and non-leaked) BGP routes for the same prefix based on the complete BGP decision process. Leaked routes do not have information about the router ID and peer IP address of the original peer and use all-zero values for these properties.
BGP always tries to resolve the BGP next hop of a leaked route using the route and tunnel table of the original (source) routing instance and this resolution information is carried with the leaked route, avoiding the need to leak the resolving routes as well. If there is no resolving the route or tunnel in the source instance, then the unresolved route cannot be leaked unless allow-unresolved-leaking is configured and the source routing instance is the GRT. In this case, the importing VPRN tries to resolve the BGP next hop of the leaked route by using its own route table (and according to its own BGP next-hop-resolution configuration options).
If a target instance has BGP multipath and ECMP enabled and some of the equal-cost best paths for a prefix are leaked routes, they can be used along with non-leaked best paths as ECMP next hops of the route.
When BGP fast reroute is enabled in a target instance T (for a particular IP prefix), BGP attempts to find a qualifying backup path by considering both leaked and non-leaked BGP routes. The backup path criteria are unchanged by this feature, that is, the backup path is the best remaining path after the primary paths and all paths with the same BGP next hops as the primary paths have been removed.
A leaked BGP route can be advertised to direct BGP neighbors of the target routing instance. The BGP next hop of a leaked route is automatically reset to itself whenever it is advertised to a peer of the target instance. Normal route advertisement rules apply, meaning that by default, the leaked route is advertised only if (in the target instance) it is the overall best path and is used as the active route to the destination and is not blocked by the IBGP-to-IBGP split-horizon rule.
A BGP route resolved in the source routing instance and leaked into a VPRN can be exported from the VPRN as a VPN-IPv4 or VPN-IPv6 route if it matches the VRF export policy. In this case, normal VPN export rules apply, meaning that by default, the leaked route is exported only if (in the VPRN) it is the overall best path and is used as the active route to the destination.
A BGP route that is unresolved in the GRT, leaked into a VPRN, and resolved by a BGP-VPN route in the VPRN cannot be exported from the VPRN as a VPN-IPv4 or VPN-IPv6 route unless it matches the VRF export policy and the VPRN is configured with the allow-bgp-vpn-export command.
Note: A leaked route cannot be exported as a VPN-IP route and then re-imported into another local VPRN. |
BGP Route Reflectors (RRs) are used in networks to improve network scalability by eliminating or reducing the need for a full-mesh of IBGP sessions. When a BGP RR receives multiple paths for the same IP prefix, it typically selects a single best path to send to all clients. If the RR has multiple nearly-equal best paths and the tie-break is determined by the next-hop cost, the RR advertises the path based on its view of next-hop costs. The advertised route may differ from the path that a client would select if it had visibility of the same set of candidate paths and used its own view of next-hop costs.
Non-optimal advertisements by the RR can be a problem in hot-potato routing designs. Hot-potato routing aims to hand off traffic to the next AS using the closest possible exit point from the local AS. In this context, the closest exit point implies minimum IGP cost to reach the BGP next-hop. SR OS implements the hot-potato routing solution described in draft-ietf-idr-bgp-optimal-route-reflection
Optimal Route Reflection (ORR) is supported in the base router BGP instance only. It applies to routes in the following address families: IPv4 unicast, label-IPv4, label-IPv6 (6PE), VPN-IPv4, and VPN-IPv6.
Note: For the RR to compare two VPN routes (and therefore for ORR to apply), the routes must contain the same RD and IP prefix information. |
ORR locations are created when config>router>bgp>orr>location is configured. The RR can maintain information for a maximum of 255 ORR locations. A primary IPv4 or primary IPv6 address is required for each location; optionally, specify a secondary and tertiary IPv4 and IPv6 addresses for the location. The IP addresses are used to find a node in the network topology that can serve as the root for SPF calculations. The IP addresses must correspond to loopback or system IP addresses of routers that participate in IGP protocols. The secondary and tertiary IP address parameters provide redundancy in case the node selected to be root for the SPF calculations disappears.
The route reflector's TE database, populated with information from local IGP instances or BGP-LS NLRI, is used to compute the SPF cost from each ORR location to IPv4 and IPv6 BGP next-hops in the candidate set of best paths. The use of BGP-LS allows the route reflector to learn IGP topology information for OSPF areas, IS-IS levels, and others in which the route reflector is not a direct participant.
To configure an ORR client, configure the cluster command for the BGP session to reference one of the defined ORR locations. The association of a client with an ORR location is not automatic. Choose an ORR location as close as possible to the client that is being configured. The allow-local-fallback option of the cluster command affects RR behavior when no BGP routes are reachable from the ORR location of the client. When allow-local-fallback is configured, the RR is allowed, in this circumstance only, to advertise the best reachable BGP path from its own topology location. If allow-local-fallback is not configured and this situation applies, then no route is advertised to the client.
Note: ORR is supported with Add-Paths; Add-Paths advertised to an ORR client are based on ORR location. |
The tunnels used by the system for resolution of BGP next-hops or prefixes and BGP labeled unicast routes can be constrained using LSP administrative tags. Refer to the “LSP Tagging and Auto-Bind Using Tag Information” section of the 7450 ESS, 7750 SR, 7950 XRS, and VSR MPLS Guide for more information.
BGP-LS is a new BGP address family that is intended to distribute IGP topology information to external servers such as Application Later Traffic Optimization (ALTO) or Path Computation Engines (PCE) servers. These external traffic engineering databases can then use this information when calculating optimal paths.
BGP-LS provides external ALTO and PCE servers with topology information for a multi-area or multi-level network. Through the use of one or two BGP-LS speakers per area or level, the external ALTO or PCE servers can receive full topology information for the entire network. The BGP-LS information can also be distributed through route reflectors supporting the BGP-LS to minimize the peering requirements.
Figure 27 shows a sample BGP-LS network.
The following BGP-LS components are currently supported.
Protocol-ID:
NLRI Types:
Node Descriptor TLVs:
Node Attribute TLVs:
Link Descriptors TLVs:
Link Attributes TLVs:
Prefix Descriptors TLVs:
Prefix Attributes TLVs:
SR OS can collect BGP-LU traffic statistics.
Traffic statistics can be collected on egress data paths. This requires the use of egress-statistics keyword when creating an import policy and that the BGP tunnel exists for the corresponding prefix. If multiple paths exist (for example, ECMP), a single statistical index is allocated and reflects the traffic sent over all paths.
Traffic statistics can also be collected on ingress data paths if the label is assigned and effectively advertised per prefix. This typically requires the use of advertise-label per-prefix when creating the import policy and applies whether or not the sr-label-index keyword is in use. However, there are cases where this may not result in a per-prefix label advertisement. When a non-BGP route (for example, static route) is requested to be advertised (advertise-inactive) with a Label Per Prefix (LPP) policy but it exists as an active RTM route and as inactive BGP route, the system does not use the LPP but instead uses the LPNH policy. Statistics are not counted for this prefix. An imported (local loopback) SR label route can also be configured to use the ingress-statistics keyword by using a route table import policy under rib-management (either label-ipv4 or label-ipv6).
Overall, BGP-LU statistics apply at the:
Note:
|
Control messages sent over the BGP-LU tunnel are accounted for in traffic statistics.
BGP-LU statistics are not supported for imported LDP routes (ldp-bgp stitching) or for VPN labels (for example, inter-AS B or C).
BGP Egress Peer Engineering (BGP EPE) extends the source-based segment routing (SR) capabilities beyond the AS boundary toward directly attached BGP peers. Operators can use a central controller to enforce more programmatic control of traffic distribution across these BGP peering links.
An SR SID can be allocated to a BGP peering segment and advertised in BGP Link State (BGP-LS) toward a controller, such as the Nokia NSP. The instantiation of the following BGP peering segments is supported:
The controller includes the specific SID in the path for an SR-TE LSP or SR policy, which it programs at the head-end LER. EPE enables a head-end router to steer traffic across a downstream peering link to a node, for traffic optimization, resiliency, or load-balancing purposes.
Figure 28 shows an example use case for BGP EPE.
In this example, there are two IGP domains with eBGP running between R5 and R6, and between R5 and R8. These adjacencies are not visible to R1, which is external to the IGP domain. At R5, separate peer node SIDs are allocated for R6 and R8. The peer node SIDs are advertised to the Nokia NSP in BGP-LS. This allows the NSP to compute a path across either the R5-R6 adjacency or the R5-R8 adjacency by including the appropriate peer node SID in the path. In the preceding use case, the R5-R6 adjacency is preferable. This peer node SID can be included in either the SR-ERO of an SR-TE LSP that is computed by PCEP or the segment list of BGP SR policy that is programmed at R1. Traffic on this LSP or SR policy is, therefore, steered across the required peering.
EPE is supported for BGP neighbors with either eBGP or iBGP sessions. SR peer node SIDs and peer adjacency SIDs are supported. The SID labels are dynamically allocated from the local label space on the node and advertised in BGP-LS using the encoding specified in section 4 of draft-ietf-idr-bgpls-segment-routing-epe-19.
The peering node can behave as both an LSR and an LER for steering traffic towards the peering segment. Both ILM and LTN entries are programmed for peer node SIDs and peer adjacency SIDs, with a label swap to or push of an implicit null label.
Note:
|
ECMP is supported by default if there are multiple peer adjacency SIDs. BGP will only allocate peer adjacency SIDs to the ECMP set of next hops toward the peer node. For non-ECMP next hops, only a peer adjacency SID is allocated and it is advertised if all ECMP sets go down.
LSP ping and LSP trace echo requests are supported by including a label representing a peer node SID or peer adjacency SID in a NIL FEC of the target FEC stack. An EPE router can validate and respond to an LSP ping or trace echo request containing this FEC.
In addition to enabling the BGP-LS route family for a BGP neighbor, the following CLI is required to send the Egress Peering Segments described in BGP Egress Peer Engineering for Segment Routing using the NLRI Type 2 with protocol ID set to BGP-EPE.
When egress-peer-engineering is administratively enabled, BGP registers with SR and the router starts advertising any peer node and peer adjacency SIDs in BGP-LS.
To allocate peer node and peer adjacency SIDs, use the following syntax to configure the egress-engineering command and enable BGP-EPE for a BGP neighbor or group.
The BGP egress-engineering at the neighbor level overrides the group level configuration. When a neighbor does not have an egress-engineering configuration context, the group configuration is inherited in the following cases.
When a neighbor has egress-engineering configured and in the default disabled state, egress-engineering is disabled for the neighbor, irrespective of the disabled, enabled, or no-context configuration at the group level. When a neighbor has egress-engineering configured and enabled, egress-engineering is enabled for the neighbor, irrespective of the disabled, enabled, or no-context configuration at the group level.
By default, enabling egress-engineering at the peer or group level causes SID values (MPLS labels) to be dynamically allocated for the peer node segment and the peer adjacency segments. Although the labels are assigned when the neighbor or group is configured, they are not programmed until the adjacency comes up. Peer node segments are derived from the BGP next hops used to reach a specific peer. If the node reboots, these dynamically allocated label values may change and are re-announced in BGP-LS.
If a BGP neighbor goes down, the router advertises a delete for all SIDs associated with the neighbor and deprograms them from the IOM. However, the label values for the SIDs are not released and the router re-advertises the same values when the BGP neighbor comes back up.
If a BGP neighbor is deleted from the configuration or is shut down, or egress-engineering is disabled, the router advertises a delete for all SIDs associated with the neighbor and deprograms them from the IOM. The router also releases the label values for the SIDs.
This feature allows BGP peers to be associated with the BFD session. If the BFD session fails, BGP peering is also torn down.
Figure 29 displays the process to provision basic BGP parameters.
This section describes BGP configuration caveats.
The following list summarizes the BGP configuration defaults:
The router implementation of the RFC 1657 MIB variables listed in Table 14 differs from the IETF MIB specification.
MIB Variable | Description | RFC 1657 Allowed Values | SR OS Allowed Values |
bgpPeerMinRouteAdvertisementInterval | Time interval in seconds for the MinRouteAdvertisementInterval timer. The suggested value for this timer is 30. | 1 to 65535 | 1 to 255 A value of 0 is supported when the rapid-update command is applied to an address family that supports it. |
If SNMP is used to set a value of X to the MIB variable in Table 14, there are three possible results:
Condition | Result |
X is within IETF MIB values and X is within SR OS values | SNMP set operation does not return an error MIB variable set to X |
X is within IETF MIB values and X is outside SR OS values | SNMP set operation does not return an error MIB variable set to “nearest” SR OS supported value (for example, SR OS range is 2 - 255 and X = 65535, MIB variable is set to 255) Log message generated |
X is outside IETF MIB values and X is outside SR OS values | SNMP set operation returns an error |
When the value set using SNMP is within the IETF allowed values and outside the SR OS values as specified in Table 14 and Table 15, a log message is generated.
The log messages that display are similar to the following log messages:
Sample Log Message for setting bgpPeerMinRouteAdvertisementInterval to 256
Sample Log Message for setting bgpPeerMinRouteAdvertisementInterval to 1