5. BGP

5.1. BGP Overview

Border Gateway Protocol (BGP) is an inter-Autonomous System routing protocol. An Autonomous System (AS) is a set of routers managed and controlled by a common technical administration. BGP-speaking routers establish BGP sessions with other BGP-speaking routers and use these sessions to exchange BGP routes. A BGP route provides information about a network path that can reach an IP prefix or other type of destination. The path information in a BGP route includes the list of ASes that must be traversed to reach the route source; this allows inter-AS routing loops to be detected and avoided. Other path attributes that may be associated with a BGP route include the Local Preference, Origin, Next-Hop, Multi-Exit Discriminator (MED) and Communities. These path attributes can be used to implement complex routing policies.

The primary use of BGP was originally Internet IPv4 routing but multi-protocol extensions to BGP have greatly expanded its applicability. Now BGP is used for many purposes, including:

  1. Internet IPv6 routing
  2. Inter-domain multicast support
  3. L3 VPN signaling (unicast and multicast)
  4. L2 VPN signaling (BGP auto-discovery for LDP-VPLS, BGP-VPLS, BGP-VPWS, multi-segment pseudowire routing, EVPN)
  5. Setup of inter-AS MPLS LSPs
  6. Distribution of flow specification rules (filters/ACLs)

The next sections provide information about BGP sessions, BGP network design, BGP messages and BGP path attributes.

5.2. BGP Sessions

A BGP session is a TCP connection formed between two BGP routers over which BGP messages are exchanged. There are three types of BGP sessions: internal BGP (IBGP), external BGP (EBGP), and confederation external BGP (confed-EBGP).

An IBGP session is formed when the two BGP routers belong to the same Autonomous System. Routes received from an IBGP peer are not advertised to other IBGP peers unless the router is a route reflector. The two routers that form an IBGP session are usually not directly connected. Figure 17 shows an example of two Autonomous Systems that use BGP to exchange routes. In this example the router ALA-A forms IBGP sessions with ALA-B and ALA-C.

An EBGP session is formed when the two BGP routers belong to different Autonomous Systems. Routes received from an EBGP peer can be advertised to any other peer. The two routers that form an EBGP session are often directly connected but multi-hop EBGP sessions are also possible. When a route is advertised to an EBGP peer the Autonomous System number(s) of the advertising router are added to the AS Path attribute. In the example of Figure 17 the router ALA-A forms an EBGP session with ALA-D.

Figure 17:  BGP Sessions 

A confederation EBGP session is formed when the two BGP routers belong to different member AS of the same confederation. More details about BGP confederations are provided in the section titled BGP Confederations.

SR OS supports both statically configured and dynamic (unconfigured) BGP sessions. Dynamic sessions are supported by configuring one or more prefix commands in the dynamic-neighbor>match CLI context of a BGP group. Statically configured BGP sessions are configured using the neighbor command. This command accepts either an IPv4 or IPv6 address, which allows the session transport to be IPv4 or IPv6. By default, the router is the active side of TCP connections to statically configured remote peers, meaning that as soon as a session leaves the Idle state, the router attempts to set up an outgoing TCP connection to the remote neighbor in addition to listening on TCP port 179 for an incoming connection from the peer. If required, a statically configured BGP session can be configured for passive mode so that the router only listens for an incoming connection and does not attempt to set up the outgoing connection. The router always operates in passive mode with respect to its dynamic (unconfigured) sessions.

The source IP address used to set up the TCP connection to the statically configured or dynamic peer can be configured explicitly using the local-address command. If a local-address is not configured then the source IP address is determined as follows:

  1. If the neighbor’s IP address belongs to a local subnet the source IP address is this router’s IP address on that subnet
  2. If the neighbor’s IP address does not belong to a local subnet the source IP address is this router’s system IP address

In addition, it is possible to configure the local address with the name of the router interface. To configure the BGP local address to use the router interface’s IP address information, the local-address command is used in conjunction with the router interface name (ip-int-name) parameter.

Configuring the router interface as the local address is available in both the config>router>bgp>group context and the config>router>bgp>group-neighbor context.

When the router interface is configured as the local address, BGP inherits the address from the interface as follows:

  1. BGPv4 sessions — the primary IPv4 address configured on the interface is used as the local address
  2. BGPv6 sessions — the primary IPv6 address configured on the interface is used as the local address

If the corresponding IPv4 or IPv6 address is not configured on the router interface, the BGP sessions that have this interface set as the local address are kept down until an interface address is configured on the router interface.

If the primary IPv4 or IPv6 address is changed on the router interface and that interface is being used as the local address for BGP, then BGP bounces the link. This removes all routes advertised using the previous address and start advertising those routes again using the newly configured IP address.

5.2.1. BGP Session States

A BGP session is in one of the following states at any given moment in time:

  1. Idle — This is the state of a BGP session when it is administratively disabled with the shutdown command, indicated with (Shutdown). In this state no incoming TCP connection is accepted from the peer. When the session is administratively enabled it transitions out of the Idle state immediately. When the session is restarted automatically it may not leave the Idle state immediately if damp-peer-oscillations is configured, damp-peer-oscillations holds a session in the Idle state for exponentially increasing amounts of time if the session is unstable and resets frequently.
  2. Connect — This is the state of a BGP session when the router, acting in Active mode, is attempting to establish an outbound TCP connection with the remote peer.
  3. Active — This is the state of a BGP session when the router is listening for an inbound TCP connection attempt from the remote peer.
  4. OpenSent — This is the state of a BGP session when the router has sent an OPEN message to its peer in reaction to successful setup of the TCP connection and is waiting for an OPEN message from the peer.
  5. OpenConfirm — This is the state of a BGP session after the router has received an acceptable OPEN message from the peer and sent a KEEPALIVE message in response and is waiting for a KEEPALIVE message from the peer. TCP connection collision procedures may be performed at this stage. Refer to RFC 4271, A Border Gateway Protocol 4 (BGP-4), for more details.
  6. Established — This is the state of a BGP session after the router has received a KEEPALIVE message from the peer. In this state BGP can advertise and withdraw routes by sending UPDATE messages to its peer.

5.2.2. Detecting BGP Session Failures

If a router suspects that its peer at the other end of an established session has experienced a complete failure of both its control and data planes the router should divert traffic away from the failed peer as quickly as possible in order to minimize traffic loss. There are various mechanisms that the router can use to detect such failures, including:

  1. BGP session hold timer expiry. See the section titled Keepalive Message for more details about this mechanism.
  2. Peer tracking
  3. BFD
  4. Fast external failover

When any one or these mechanisms is triggered the session immediately returns to the Idle state and a new session is attempted. Peer tracking, BFD and fast external failover are described in more detail in the following sections.

5.2.2.1. Peer Tracking

When peer tracking is enabled on a session, the neighbor IP address is tracked in the routing table. If a failure occurs and there is no longer any IP route matching the neighbor address, or else if the longest prefix match (LPM) route is rejected by the configurable peer-tracking-policy, then after a 1-second delay the session is taken down. By default, peer-tracking is disabled on all sessions. The default peer-tracking policy allows any type of route to match the neighbor IP address, except aggregate routes and LDP shortcut routes.

Peer tracking was introduced when BFD was not yet supported for peer failure detection. Now that BFD is available, peer-tracking has less value and is used less often.

Note:

Peer tracking should be used with caution. Peer tracking can tear a session down even if the loss of connectivity turns out to be short-lived — for example while the IGP protocol is re-converging. Next-hop tracking, which is always enabled, handles such temporary connectivity issues much more effectively.

5.2.2.2. Bidirectional Forwarding Detection (BFD)

SR OS also supports the option to setup an async-mode BFD session to a BGP neighbor so that failure of the BFD session can trigger immediate teardown of the BGP session. When BFD is enabled on a BGP session a 1-hop or multi-hop BFD session is setup to the neighbor IP address and the BFD parameters come from the BFD configuration of the interface associated with the local-address; for multi-hop sessions this is typically the system interface. With a 10 ms transmit interval and a multiplier of 3 BFD can detect a peer failure in a period of time as short of 30 ms.

5.2.2.3. Fast External Failover

Fast external failover applies only to single-hop EBGP sessions. When fast external failover is enabled on a single-hop EBGP session and the interface associated with the session goes down the BGP session is immediately taken down as well, even if other mechanisms such as the hold-timer have not yet indicated a failure.

5.2.3. High Availability BGP Sessions

A BGP session reset can be very disruptive – each router participating in the failed session must delete the routes it received from its peer, recalculate new best paths, update forwarding tables (depending on the types of routes), and send route withdrawals and advertisements to other peers. It makes sense then that session resets should be avoided as much as possible and when a session reset cannot be avoided the disruption to the network should be minimized.

To support these objectives, the BGP implementation in SR OS supports two key features:

  1. BGP high availability (HA)
  2. BGP graceful restart (GR)

BGP HA refers to the capability of a router with redundant CPMs to keep established BGP sessions up whenever a planned or unplanned CPM switchover occurs. A planned CPM switchover can occur during In-Service Software Upgrade (ISSU). An unplanned CPM switchover can occur if there is an unexpected failure of the primary CPM.

BGP HA is always enabled on routers with redundant CPMs; it cannot be disabled. BGP HA keeps the standby CPM in-sync with the primary CPM, with respect to BGP and associated TCP state, so that the standby CPM is ready to take over for the primary CPM at any time. The primary CPM is responsible for building and sending the BGP messages to peers but the standby CPM reliably receives a copy of all outgoing UPDATE messages so that it has a synchronized view of the RIB-OUT.

5.2.3.1. BGP Graceful Restart

Some BGP routers do not have redundant control plane processor modules or do not support BGP HA with the same quality or coverage as 7450 ESS, 7750 SR, or 7950 XRS routers. When dealing with such routers or certain error conditions, BGP graceful restart (GR) is a good option for minimizing the network disruption caused by a control plane reset.

BGP GR assumes that the router restarting its BGP sessions has the ability and architecture to continue packet forwarding throughout the control plane reset. If this is the case, then the peers of the restarting router act as helpers and “hide” the control plane reset from the rest of the network so that forwarding can continue uninterrupted. Forwarding based on stale routes and hiding the “staleness” from other routers is considered acceptable because the duration of the control plane outage is expected to be relatively short (a few minutes). For BGP GR to be used on a session, both routers must advertise the BGP GR capability during the OPEN message exchange; see the BGP Advertisement section for more details.

BGP GR is enabled on one or more BGP sessions by configuring the graceful-restart command in the global, group, or neighbor context. The command causes GR mode to be supported for the following active families:

  1. IPv4 unicast
  2. IPv6 unicast
  3. VPN-IPv4
  4. VPN-IPv6
  5. Label-IPv4
  6. Label-IPv6
  7. L2-VPN
  8. route-target (RTC)
  9. Flow-IPv4 (IPv4 FlowSpec)
  10. Flow-IPv6 (IPv6 FlowSpec)

Helper mode is activated when one of the following events affects an Established session:

  1. TCP socket error
  2. New inbound TCP connection from the peer
  3. Hold timer expiry
  4. Peer unreachable
  5. BFD down
  6. Sent NOTIFICATION message (only if enable-notification is configured under graceful-restart, and the peer set the “N” bit in its GR capability, and the NOTIFICATION is not a Cease with subcode Hard Reset)
  7. Received NOTIFICATION message (only if enable-notification is configured under graceful-restart, and the peer set the “N” bit in its GR capability, and the NOTIFICATION is not a Cease with subcode Hard Reset)

As soon as the failure is detected, the helping 7450 ESS, 7750 SR, or 7950 XRS router marks all the routes received from the peer as stale and starts a restart timer. The stale state is not factored into the BGP decision process, and it is not made visible to other routers in the network. The restart timer derives its initial value from the Restart Time carried in the last GR capability of the peer. The default advertised Restart Time is 300 seconds, but it can be changed using the restart-time command.

When the restart timer expires, helping stops if the session is not yet re-established. If the session is re-established before the restart timer expires and the new GR capability from the restarting router indicates that the forwarding state has been preserved, then helping continues and the peers exchange routes per the normal procedure.

When each router has advertised all its routes for a specific address family, it sends an End-of-RIB marker (EOR) for the address family. The EOR is a minimal UPDATE message with no reachable or unreachable NLRI for the AFI or SAFI. When the helping router receives an EOR, it deletes all remaining stale routes of the AFI or SAFI that were not refreshed in the most recent set of UPDATE messages. The maximum amount of time that routes can remain stale (before being deleted if they are not refreshed) is configurable using the stale-routes-time.

Note:

If a second reset occurs before GR has successfully completed, the router always aborts the GR helper process, regardless of the failure trigger.

5.2.3.2. BGP Long-Lived Graceful Restart

SR OS supports Long-Lived Graceful Restart (LLGR). LLGR is supported for the same address families as normal GR, as described in BGP Graceful Restart.

The LLGR procedures adhere to draft-uttaro-idr-bgp-persistence-03. LLGR is intended to handle more serious and longer-term outages than ordinary GR.

SR OS routers support LLGR in the context of both the restarting router (which experienced a restart or failure) and the helper or receiving router (which is a peer of the failed router). Both functionalities are enabled and disabled at the same time by adding the long-lived command under a graceful-restart configuration context.

When long-lived is applied to a session (and capability negotiation is not disabled), the OPEN message sent to the peer includes both the GR capability and the LLGR capability. Both capabilities list the same set of AFI/SAFI.

5.2.3.2.1. LLGR Operations

If a BGP session protected by LLGR goes down due to a restart or failure of the peer, then the SR OS router activates GR+LLGR helper mode for all the protected AFI/SAFI. In GR+LLGR helper mode, the received routes of a particular AFI/SAFI are retained as stale routes for a maximum duration of:

restart-time + LLGR-stale-time

where:

  1. restart-time is signaled in the GR capability of the peer (but overridden, if necessary, by the locally-configured helper-override-restart-time command)
  2. LLGR-stale-time is signaled in the LLGR capability of the peer (but overridden, if necessary, by a locally-configured helper-override-stale-time command)

While the restart-timer is running, the SR OS router acts in the normal GR helper role. When the restart-timer elapses, the LLGR phase begins. When LLGR starts, the following occur.

  1. The LLGR-stale-time starts to count down.
  2. Stale routes marked with the NO_LLGR community are immediately deleted.
  3. Remaining stale routes are not preferred. The BGP best path selection algorithm is rerun with a new first step that prefers valid, non-stale LLGR routes over any stale LLGR routes.
  4. If a de-preferenced stale route remains, the best and valid NH-reachable path for the NLRI is re-advertised, with an added LLGR_STALE community, to peers that signaled support for the LLGR capability. The route may be withdrawn or re-advertised towards peers that do not support LLGR, subject to the configuration of the advertise-stale-to-all-neighbors command and without-no-export parameter.

LLGR ends for a particular AFI/SAFI when the LLGR-stale-time reaches zero. At that time, all remaining stale routes of the AFI/SAFI are deleted. The LLGR-stale-time is not stopped by re-establishment of the session with the failed peer; it continues until the EoR marker is received for the AFI/SAFI.

Stale routes may be deleted prior to expiration of the LLGR-stale-time. If the session with the failed peer comes back up and one of the following is true, then the stale routes should be deleted immediately:

  1. the GR or LLGR capability is missing
  2. the AFI/SAFI is missing from the LLGR capability
  3. the F bit = 0 for the AFI/SAFI

5.2.3.2.2. Receiving Routes with LLGR_STALE Community

When a router running SR OS Release 15.0.R4 or later receives a BGP route of any AFI/SAFI, with the LLGR_STALE community, the decision process considers the route less preferred than any valid, non-stale LLGR route for that NLRI. This logic applies even if the router is not configured as long-lived. If a route with an LLGR_STALE community is selected as the best path, then it is advertised to peers according to the configuration of the advertise-stale-to-all-neighbors command; if this command is absent (or the long-lived context is absent), then the route is advertised only to peers that advertised the LLGR capability.

5.2.4. BGP Session Security

5.2.4.1. TCP MD5 Authentication

The operation of a network can be compromised if an unauthorized system is able to form or hijack a BGP session and inject control packets by falsely representing itself as a valid neighbor. This risk can be mitigated by enabling TCP MD5 authentication on one or more of the sessions. When TCP MD5 authentication is enabled on a session every TCP segment exchanged with the peer includes a TCP option (19) containing a 16-byte MD5 digest of the segment (more specifically the TCP/IP pseudo-header, TCP header and TCP data). The MD5 digest is generated and validated using an authentication key that must be known to both sides. If the received digest value is different from the locally computed one then the TCP segment is dropped, thereby protecting the router from spoofed TCP segments.

5.2.4.2. TTL Security Mechanism

The TTL security mechanism (GTSM) relies on a simple concept to protect BGP infrastructure from spoofed IP packets. It recognizes the fact that the vast majority of EBGP sessions are established between directly-connected routers and therefore the IP TTL values in packets belonging to these sessions should have predictable values. If an incoming packet does not have the expected IP TTL value it is possible that it is coming from an unauthorized and potentially harmful source.

TTL security is enabled using the ttl-security command. This command requires a minimum TTL value to be specified. When TTL security is enabled on a BGP session the IP TTL values in packets that are supposedly coming from the peer are compared (in hardware) to the configured minimum value and if there is a discrepancy the packet is discarded and a log is generated. TTL security is used most often on single-hop EBGP sessions but it can be used on multihop EBGP and IBGP sessions as well.

To enable TTL security on a single-hop EBGP session, configure ttl-security and multihop to a value of 255. To enable TTL security on a multihop EBGP session, configure ttl-security and multihop to match the expected TTL of (255 - hop count). The TTL value for both EBGP peers must be manually configured to the same value, as there is no TTL negotiation.

Note:

IP packets sent to an IBGP peer are originated with an IP TTL value of 64. IP packets to an EBGP peer are originated with an IP TTL value of 1, except if multihop is configured; in that case, the TTL value is taken from the multihop command.

5.2.5. BGP Address Family Support for Different Session Types

When the base router has a neighbor identified by an IPv4 address, and therefore the transport of the BGP session uses IPv4 TCP, all MP-BGP address families available in SR OS are supported by that session.

When the base router has a neighbor identified by an IPv6 address, and therefore the transport of the BGP session uses IPv6 TCP, the following MP-BGP address families are supported:

  1. ipv4
  2. ipv6
  3. mcast-ipv4
  4. mcast-ipv6
  5. vpn-ipv4
  6. vpn-ipv6
  7. evpn
  8. flow-ipv6
  9. label-ipv4
  10. label-ipv6
  11. bgp-ls

When a VPRN has a neighbor identified by an IPv4 address, and therefore the transport is IPv4 TCP, the following MP-BGP address families are supported:

  1. ipv4
  2. ipv6
  3. mcast-ipv4
  4. mcast-ipv6
  5. flow-ipv4
  6. flow-ipv6
  7. label-ipv4

When a VPRN has a neighbor identified by an IPv6 address, and therefore the transport is IPv6 TCP, the following MP-BGP address families are supported:

  1. ipv4
  2. ipv6
  3. mcast-ipv4
  4. mcast-ipv6
  5. flow-ipv6

5.2.6. BGP Groups

In SR OS, every neighbor (and hence BGP session) is configured under a group. A group is a CLI construct that saves configuration effort when multiple peers have a similar configuration; in this situation the common configuration commands can be configured once at the group level and need not be repeated for every neighbor. A single BGP instance can support many groups and each group can support many peers. Most SR OS commands that are available at the neighbor level are also available at the group level.

5.3. BGP Design Concepts

BGP assumes that all routers within an Autonomous System can reach destinations external to the Autonomous System using efficient, loop-free intra-AS forwarding paths. This generally requires that all the routers within the AS have a consistent view of the best path to every external destination. This is especially true when each BGP router in the AS makes its own forwarding decisions based on its own BGP routing table. The basic BGP specification does not store any intra-AS path information in the AS Path attribute so basic BGP has no way to detect routing loops within an AS that arise from inconsistent best path selections.

There are 3 solutions for dealing the issues outlined above.

  1. Create a full-mesh of IBGP sessions within the AS as shown in Figure 18. This ensures routing consistency but does not scale well because the number of sessions increases exponentially with the number of BGP routers in the AS.
  2. Use BGP route reflectors in the AS. Route reflection is described in the section titled Route Reflection. BGP route reflectors allow for routing consistency with only a partial mesh of IBGP sessions within the AS.

Create a confederation of autonomous systems. BGP confederations are described in the section titled BGP Confederations.

Figure 18:  Fully Meshed BGP Configuration 

5.3.1. Route Reflection

In a standard BGP configuration a BGP route learned from one IBGP peer is not re-advertised to another IBGP peer. This rule exists because of the assumption of a full IBGP mesh within the AS. As discussed in the previous section a full IBGP mesh imposes certain scaling challenges. BGP route reflection eliminates the need for a full IBGP mesh by allowing routers configured as route reflectors to re-advertise routes from one IBGP peer to another IBGP peer.

A route reflector provides route reflection service to IBGP peers called clients. Other IBGP peers of the RR are called non-clients. An RR and its client peers form a cluster. A large AS can be sub-divided into multiple clusters, each identified by a unique 32-bit cluster ID. Each cluster contains at least one route reflector which is responsible for redistributing routes to its clients. The clients within a cluster do not need to maintain a full IBGP mesh between each other; they only require IBGP sessions to the route reflector(s) in their cluster. (If the clients within a cluster are fully meshed consider using the disable-client-reflect functionality.) The non-clients in an AS must be fully meshed with each other.

Figure 19 depicts the same network as Figure 18 but with route reflectors deployed to eliminate the IBGP mesh between SR-B, SR-C, and SR-D. SR-A, configured as the route reflector, is responsible for reflection routes to its clients SR-B, SR-C, and SR-D. SR-E and SR-F are non-clients of the route reflector. As a result, a full mesh of IBGP sessions must be maintained between SR-A, SR-E and SR-F.

Figure 19:  BGP Configuration with Route Reflectors 

A router becomes a route reflector whenever it has one or more client IBGP sessions. A client IBGP session is created with the cluster command, which also indicates the cluster ID of the client. Typical practice is to use the router ID as the cluster ID, but this is not necessary.

Basic route reflection operation (without Add-Path configured) can be summarized as follows:

  1. If the best and valid path for an NLRI is learned from a client and disable-client-reflect is not configured then advertise that route to all clients, non-clients and EBGP peers (as allowed by policy). If the client that advertised the best and valid path is a neighbor to which the split-horizon command (at the bgp, group or neighbor level) applies then the route is not advertised back to the sending client. In the route that is reflected to clients and non-clients:
    1. The route reflector adds an ORIGINATOR_ID attribute if it did not already exist; the ORIGINATOR_ID indicates the BGP identifier (router ID) of the client that originated the route.
    2. The route reflector prepends the cluster ID of the client that advertised the route and then the cluster ID of the client receiving the route (if applicable) to the CLUSTER_LIST attribute, creating the attribute if it did not previously exist.
  2. If the best and valid path for an NLRI is learned from a client and disable-client-reflect is configured then advertise that route to all clients in other clusters, non-clients and EBGP peers (as allowed by policy). In the route that is reflected to clients in other clusters and non-clients:
    1. The route reflector adds an ORIGINATOR_ID attribute if it did not already exist; the ORIGINATOR_ID indicates the BGP identifier (router ID) of the client that originated the route.
    2. The route reflector prepends the cluster ID of the client that advertised the route and then the cluster ID of the client receiving the route (if applicable) to the CLUSTER_LIST attribute, creating the attribute if it did not previously exist.
  3. If the best and valid path for an NLRI is learned from a non-client then advertise that route to all clients and EBGP peers (as allowed by policy). In the route that is reflected to clients:
    1. The route reflector adds an ORIGINATOR_ID attribute if it did not already exist; the ORIGINATOR_ID indicates the BGP identifier (router ID) of the non-client that originated the route.
    2. The route reflector prepends the cluster ID of the client receiving the route to the CLUSTER_LIST attribute, creating the attribute if it did not previously exist.
  4. If the best and valid path for an NLRI is learned from an EBGP peer then advertise that route to all clients, non-clients and other EBGP peers (as allowed by policy). The ORIGINATOR_ID and CLIUSTER_LIST attributes are not added to the route.
  5. If the best and valid path for an NLRI is locally originated (by the RR) — i.e. it was learned through means other than BGP — then advertise that route to all clients, non-clients and EBGP peers (as allowed by policy). The ORIGINATOR_ID and CLUSTER_LIST attributes are not added to the route.

The ORIGINATOR_ID and CLUSTER_LIST attributes allow BGP to detect the looping of a route within the AS. If any router receives a BGP route with an ORIGINATOR_ID attribute containing its own BGP identifier, the route is considered invalid. In addition, if a route reflector receives a BGP route with a CLUSTER_LIST attribute containing a locally configured cluster ID, the route is considered invalid. Invalid routes are not installed in the route table and not advertised to other BGP peers.

5.3.2. BGP Confederations

BGP confederations are another alternative for avoiding a full mesh of BGP sessions inside an Autonomous System. A BGP confederation is a group of Autonomous Systems managed by a single technical administration that appear as a single AS to BGP routers outside the confederation; the single externally visible AS is called the confederation ID. Each AS in the group is called a member AS and the ASN of each member AS is visible only within the confederation. For this reason, member ASNs are often private ASNs.

Within a confederation EBGP-type sessions can be setup between BGP routers in different member AS. These confederation-EBGP sessions avoid the need for a full mesh between routers in different member ASes. Within each member AS the BGP routers must be fully-meshed with IBGP sessions or route reflectors must be used to ensure routing consistency.

In SR OS, a confederation EBGP session is formed when the ASN of the peer is different from the local ASN and the peer ASN appears as a member AS in the confederation command. The confederation command specifies the confederation ID and up to 15 member AS that are part of the confederation.

When a route is advertised to a confederation-EBGP peer the advertising router prepends its local ASN, which is its member ASN, to a confederation-specific sub-element in the AS_PATH that is created if it does not already exist. The extensions to the AS_PATH are used for loop detection but they do not influence best path selection (that is, they do not increase the AS Path length used in the BGP decision process). The MED, NEXT_HOP and LOCAL_PREF attributes in the received route are propagated unchanged by default. The ORIGINATOR_ID and CLUSTER_LIST attributes are not included in routes to confed-EBGP peers.

When a route is advertised to an EBGP peer outside the confederation the advertising router removes all member AS elements from the AS_PATH and prepends its confederation ID rather than its local/member ASN.

5.4. BGP Messages

BGP protocol operation relies on the exchange of BGP messages between peers. 7450 ESS, 7750 SR, 7950 XRS, and most other routers, support the following message types: Open Message, Update Message, Keepalive Message, Notification Message, and Route Refresh Message.

The minimum BGP message length is 19 bytes and the maximum is 4096 bytes. BGP messages appear as a stream of bytes to the underlying TCP transport layer, and so there is no direct association between a BGP message and a TCP segment. One TCP segment can include parts of one or more BGP messages. Immediately after session setup, the initial value for the maximum TCP segment size that can be sent toward a specific peer is the minimum of the following:

  1. the MSS option value in the TCP SYN received from the peer, when the connection was established
  2. the IP MTU of the initial outgoing interface used to route packets to the peer, minus 40 bytes (IPv4) or minus 60 bytes (IPv6)
  3. the TCP MSS configuration value in the BGP configuration of the peer (if there is such a configuration and it specifies a value other than ip-stack)

As time elapses, the maximum sending segment size can fall below the initial value if path MTU discovery (PMTUD) is active on the session. PMTUD lowers the segment size when ICMP unreachable or packet-too-big messages are received. These messages indicate that the IP MTU of the link could not forward the unfragmentable packet and this IP MTU minus 40 (IPv4) or minus 60 (IPv6) bytes sets the new maximum segment size value.

5.4.1. Open Message

After a TCP connection is established between two BGP routers the first message sent by each one is an Open message. If the received Open message is acceptable a Keepalive message confirming the Open is sent back. (See BGP Session States for more details.) An Open message contains the following information:

  1. Version — The current BGP version number is 4.
  2. Autonomous System number — The 2-byte AS of the sending router. If the sending router has an ASN greater than 65535, this field has the special value 23456 (AS_TRANS). On a 7450, 7750, or 7950 router, the ASN in the Open message is based on the confederation ID (if the peer is external to the confederation), the global AS (configured using the autonomous-system command) or a session-level override of the global AS called the local AS (configured using the local-as command). More details about the use of local-AS are described in the section titled Using Local AS for ASN Migration. More details about 4-byte AS numbers are described in the section titled 4-Octet Autonomous System Numbers.
  3. Hold Time — The proposed maximum time BGP waits between successive messages (Keepalive and/or Update) from its peer before closing the connection. The actual hold time is the minimum of the configured hold-time for the session and the hold-time in the peer's Open message. If this minimum is below a configured threshold (min hold-time), the connection attempt is rejected.
    Note:

    Changes to the configured hold-time trigger a session reset.

  4. BGP Identifier — The router ID of the BGP speaker. In Open messages, the BGP Identifier comes from the router-id configured under bgp; if that is not configured, then the router-id configured under config>router (or config>service>vprn) is used and if that too is not configured then the system interface IPv4 address is used.
    Note:

    A change of the router ID in the config>router>bgp context causes all BGP sessions to be reset immediately while other changes resulting in a new BGP identifier only take effect after BGP is shutdown and re-enabled.

  5. Optional Parameters — A list of optional parameters, each encoded as a TLV. The only optional parameter that has been defined is the optional parameter. The optional parameter supports the process of BGP advertisement (see BGP Advertisement for more information). When a BGP router receives an Open message with an unsupported optional parameter type it terminates the session. Unless disable-capability-negotiation is configured, the router always sends an optional parameter in its Open message.

5.4.1.1. Changing the Autonomous System Number

If the AS number is changed at the router level (config>router) the new AS number is not used until the BGP instance is restarted either by administratively disabling and enabling the BGP instance or by rebooting the system with the new configuration.

On the other hand, if the AS number is changed in the BGP configuration (config>router>bgp), the effects are as follows:

  1. A change of the local-AS at the global level causes the BGP instance to restart with the new local AS number.
  2. A change of the local-AS at the group level causes BGP to re-establish sessions with all peers in the group using the new local AS number.
  3. A change of the local-AS at the neighbor level causes BGP to re-establish the session with the new local AS number.

5.4.1.2. Changing a Confederation Number

Changing the confederation value on an active BGP instance does not restart the protocol. The change takes effect when the BGP protocol is (re) initialized.

5.4.1.3. BGP Advertisement

BGP advertisement allows a BGP router to indicate to a peer, using the optional parameter, the features that it supports so that they can coordinate and use only the features that both support. Each capability in the optional parameter is TLV-encoded with a unique type code. SR OS supports the following capability codes:

  1. Multi-protocol BGP (code 1)
  2. Route refresh (code 2)
  3. Outbound route filtering (code 3)
  4. Graceful restart (code 64)
  5. 4-octet AS number (code 65)
  6. Add-path (code 69)

5.4.2. Update Message

Update messages are used to advertise and withdraw routes. An Update message provides the following information:

  1. Withdrawn routes length — The length of the withdrawn routes field that is described next (may be 0).
  2. Withdrawn routes — IPv4 prefixes that are no longer considered reachable by the advertising router.
  3. Total path attribute length — The length of the path attributes field that is discussed next (may be 0).
  4. Path attributes — The path attributes presented in variable length TLV format. The path attributes apply to all the NLRI in the UPDATE message.
  5. Network layer reachability information (NLRI) — IPv4 prefixes that are considered reachable by the advertising router.

For fast routing convergence, as many NLRI as possible are packed into a single Update message as possible. This requires identifying all the routes that share the same path attribute values.

5.4.3. Keepalive Message

After a session is established, each router sends periodic Keepalive messages to its peer to test that the peer is still alive and reachable. If no Keepalive or Update message is received from the peer for the negotiated hold-time duration, the session is terminated. The period between one Keepalive message and the next is 1/3 of the negotiated hold-time duration or the value configured with the keepalive command, whichever is less. If the active hold-time or keepalive interval is zero, Keepalive messages are not sent. The default hold-time is 90 seconds and the default keepalive interval is 30 seconds.

A peer (reachability) failure is often detected through faster mechanisms than hold-timer expiry, as explained in Detecting BGP Session Failures.

5.4.4. Notification Message

When a non-recoverable error related to a particular session occurs a Notification message is sent to the peer and the session is terminated (or restarted if GR is enabled for this scenario; see BGP Graceful Restart for more details). The Notification message provides the following information:

  1. Error code — Indicates the type of error: message header error, Open message error, Update message error, Hold timer expired, Finite State Machine error, or Cease.
  2. Error subcode — Provides more specific information about the error. The meaning of the subcode is specific to the error code.

5.4.4.1. UPDATE Message Error Handling

The approach to handling Update message errors has evolved in the past couple of years. The original BGP protocol specification called for all UPDATE message errors to be handled the same way — send a NOTIFICATION to the peer and immediately close the BGP session. This error handling approach was motivated by the goal to ensure protocol “correctness” above all else. But it ignored several important points:

  1. Not all UPDATE message errors truly have the same severity. If the NLRI cannot be extracted and parsed from an UPDATE message then this is indeed a “critical” error. But other errors such as incorrect attribute flag settings, missing mandatory path attributes, incorrect next-hop length/format, and so on. can be considered “non-critical” and handled differently.
  2. Session resets are extremely costly in terms of their impact on the stability and performance of the network. For many types of UPDATE message errors, a session reset does not solve the problem because the root cause remains (for example, software error, hardware error or misconfiguration). If a session reset is absolutely necessary, then the operator should have some control over the timing.
  3. Some degree of protocol “incorrectness” is tolerable for a short period of time as long as the network operator is fully aware of the issue. In this context, “incorrectness” typically means a BGP RIB inconsistency between routers in the same AS. Such inconsistency has become less and less of an issue over time as edge-to-edge tunneling of IP traffic (for example, BGP shortcuts, IP VPN) has reduced the number of deployments where IP traffic is forwarded hop-by-hop.

In recognition of these points and the general trend towards more flexibility in BGP error handling, SR OS supports a BGP configuration option called update-fault-tolerance that allows the operator to decide whether the router should apply new or legacy error handling procedures to UPDATE message errors. If update-fault-tolerance is configured, then non-critical errors as described above are handled using the “treat-as-withdraw” or “attribute-discard” approaches to error handling; these approaches do not cause a session reset. If update-fault-tolerance is not configured then legacy procedures continue to apply and all errors (critical and non-critical) trigger a session reset.

If the update-fault-tolerance command was previously configured and a non-critical error was already triggered, the BGP session is still reset when the operator configures no update-fault-tolerance.

5.4.5. Route Refresh Message

A BGP router can send a Route Refresh message to its peer only if both have advertised the route refresh capability (code 2). The Route Refresh message is a request for the peer to re-send all or some of its routes associated with a particular pair of AFI/SAFI values. AFI/SAFI values are the same ones used in the MP-BGP capability (see the section titled Multi-Protocol BGP Attributes).

7450, 7750, and 7950 routers only send Route Refresh messages for AFI/SAFI associated with VPN routes that carry Route Target extended communities, such as VPN-IPv4, VPN-IPv6, L2-VPN, MVPN-IPv4 and MVPN-IPv6 routes. By default, routes of these types are discarded if, at the time they are received, there is no VPN that imports any of the route targets they carry. If at a later time a VPN is added or reconfigured (in terms of the route targets that it imports), a Route Refresh message is sent to all relevant peers, so that previously discarded routes can be relearned.

Note:

Route Refresh messages are not sent for VPN-IPv4 and VPN-IPv6 routes if mp-bgp-keep is configured; in this situation received VPN-IP routes are kept in the RIB-IN regardless of whether or not they match a VRF import policy.

5.5. BGP Path Attributes

Path attributes are fundamental to BGP. A BGP route for a particular NLRI is distinguished from other BGP routes for the same NLRI by its set of path attributes. Each path attribute describes some property of the path and is encoded as a TLV in the Path Attributes field of the Update message. The type field of the TLV identifies the path attribute and the value field carries data specific to the attribute type. There are 4 different categories of path attributes:

  1. Well-known mandatory. These attributes must be recognized by all BGP routers and must be present in every Update message that advertises reachable NLRI towards a certain type of neighbor (EBGP or IBGP).
  2. Well-known discretionary. These attributes must be recognized by all BGP routers but are not required in every Update message.
  3. Optional transitive. These attributes are allowed to be unrecognized by some BGP routers. If a BGP router does not recognize one of these attributes it accepts it, passes it on to other BGP peers, and sets the Partial bit to 1 in the attribute flags byte.
  4. Optional non-transitive. These attributes are allowed to be unrecognized by some BGP routers. If a BGP router does not recognize one of these attributes it is quietly ignored and not passed on to other BGP peers.

SR OS supports the following path attributes, which are described in detail in upcoming sections:

  1. ORIGIN (well-known mandatory)
  2. AS_PATH (well-known mandatory)
  3. NEXT_HOP (well-known, required only in Update messages with IPv4 prefixes in the NLRI field)
  4. MED (optional non-transitive)
  5. LOCAL_PREF (well-known, required only in Update messages sent to IBGP peers)
  6. ATOMIC_AGGR (well-known discretionary)
  7. AGGREGATOR (optional transitive)
  8. COMMUNITY (optional transitive)
  9. ORIGINATOR_ID (optional non-transitive)
  10. CLUSTER_LIST (optional non-transitive)
  11. MP_REACH_NLRI (optional non-transitive)
  12. MP_UNREACH_NLRI (optional non-transitive)
  13. EXT_COMMUNITY (optional transitive)
  14. AS4_PATH (optional transitive)
  15. AS4_AGGREGATOR (optional transitive)
  16. CONNECTOR (optional transitive)
  17. PMSI_TUNNEL (optional transitive)
  18. TUNNEL_ENCAPSULATION (optional transitive)
  19. AIGP (optional non-transitive)
  20. BGP-LS (optional non-transitive)
  21. LARGE_COMMUNITY (optional transitive)

5.5.1. Origin

The ORIGIN path attribute indicates the origin of the path information. There are three supported values:

  1. IGP (0)
  2. EGP (1)
  3. Incomplete (2)

When a router originates a VPN-IP prefix (from a non-BGP route), it sets the value of the Origin attribute to IGP. When a router originates an BGP route for an IP prefix by exporting a non-BGP route from the routing table, it sets the value of the Origin attribute to Incomplete. Route policies (BGP import and export) can be used to change the Origin value.

5.5.2. AS Path

The AS_PATH attribute provides the list of Autonomous Systems through which the routing information has passed. The AS_PATH attribute is composed of segments. There can be up to 4 different types of segments in an AS_PATH attribute: AS_SET, AS_SEQUENCE, AS_CONFED_SET and AS_CONFED_SEQUENCE. The AS_SET and AS_CONFED_SET segment types result from route aggregation. AS_CONFED_SEQUENCE contains an ordered list of member AS through which the route has passed inside a confederation. AS_SEQUENCE contains an ordered list of AS (including confederation IDs) through which the route has passed on its way to the local AS/confederation.

The AS numbers in the AS_PATH attribute are all 2-byte values or all 4-byte values (if the 4-octet ASN capability was announced by both peers).

A BGP router always prepends its AS number to the AS_PATH attribute when advertising a route to an EBGP peer. The specific details for a 7450, 7750, or 7950 router are described below.

  1. When a route is advertised to an EBGP peer and the advertising router is not part of a confederation:
    1. The global AS (configured using the autonomous-system command) is prepended to the AS_PATH if local-as is not configured.
    2. The local AS followed by the global AS are prepended to the AS_PATH if local-as is configured.
    3. Only the local AS is prepended to the AS_PATH if local-as no-prepend-global-as is configured.
    4. Some or all private and reserved AS numbers (64512 to 65535 and 4200000000 to 4294967295 inclusive) can be removed or replaced from the AS_PATH if the remove-private command is configured.
  2. When a route is advertised to an EBGP peer outside a confederation:
    1. The confederation ID is prepended to the AS_PATH if local-as is not configured.
    2. The local AS followed by the confederation ID are prepended to the AS_PATH if local-as is configured (the no-prepend-global-as option has no effect in this scenario).
    3. Member AS numbers are removed from the AS_PATH as described in the section titled BGP Confederations.
    4. Some or all private and reserved AS numbers (64512 to 65535 and 4200000000 to 4294967295 inclusive) can be removed or replaced from the AS_PATH if the remove-private command is configured.
  3. When a route is advertised to a confederation-EBGP peer:
    1. If the route came from an EBGP peer and local-as was configured on this session (without the private option) this local AS number is prepended to the AS_PATH in a regular AS_SEQUENCE segment.
    2. The global AS (configured using the autonomous-system command) is prepended, as a member AS, to the AS_PATH if local-as is not configured.
    3. The local AS followed by the global AS are prepended, as member AS, to the AS_PATH if local-as is configured.
    4. Only the local AS is prepended, as a member AS, to the AS_PATH if local-as no-prepend-global-as is configured.
    5. Some or all private and reserved AS numbers (64512 to 65535 and 4200000000 to 4294967295 inclusive) can be removed or replaced from the AS_PATH if the remove-private command is configured.
  4. When a route is advertised to an IBGP peer:
    1. No information is added to the AS_PATH if the route is locally originated or if it came from an IBGP peer.
    2. The local AS number is prepended to the AS_PATH if the route came from an EBGP peer and local-as is configured without the private option.
    3. The local AS number is prepended, as a member AS, to the AS_PATH if the route came from a confederation-EBGP peer and local-as is configured without the private option.
    4. Some or all private and reserved AS numbers (64512 to 65535 and 4200000000 to 4294967295 inclusive) can be removed or replaced from the AS_PATH if the remove-private command is configured.

BGP import policies can be used to prepend an AS number multiple times to the AS_PATH, whether the route is received from an IBGP, EBGP or confederation EBGP peer. The AS path prepend action is also supported in BGP export policies applied to these types of peers, regardless of whether the route is locally originated or not. AS path prepending in export policies occurs before the global and/or local ASes (if applicable) are added to the AS_PATH.

When a BGP router receives a route containing one of its own Autonomous System numbers (local or global or confederation ID) in the AS_PATH the route is normally considered invalid for reason of an AS path loop. However, SR OS provides a loop-detect command that allows this check to be bypassed. If it known that advertising certain routes to an EBGP peer results in an AS path loop condition and yet there is no loop (assured by other mechanisms, such as the Site of Origin (SOO) extended community), then as-override can be configured on the advertising router instead of disabling loop detection on the receiving router. The as-override command replaces all occurrences of the peer AS in the AS_PATH with the advertising router’s local AS.

5.5.2.1. AS Override

The AS Override feature can be used in VPRN scenarios where a customer is running BGP as the PE-CE protocol and some or all of the CE locations are in the same Autonomous System (AS). With normal BGP, two sites in the same AS would not be able to reach each other directly since there is an apparent loop in the AS Path.

When as-override is configured on a PE-CE EBGP session the PE rewrites the customer ASN in the AS Path with the VPRN AS number as the route is advertised to the CE.

5.5.2.2. Using Local AS for ASN Migration

The description in the previous section does fully explain the reasons for using local-as. This BGP feature facilitates the process of changing the ASN of all the routers in a network from one number to another. This may be necessary if one network operator merges with or acquires another network operator and the two BGP networks must be consolidated into one Autonomous System.

For example, suppose the operator of the ASN 64500 network merges with the operator of the ASN 64501 network and the new merged entity decides to renumber ASN 64501 routers as ASN 64500 routers, so that the entire network can be managed as one Autonomous System. The migration can be carried out using the following sequence of steps:

  1. Change the global AS of the route reflectors that used to be part of ASN 64501 to the new value 64500.
  2. Change the global AS of the RR clients that used to be part of ASN 64501 to the new value 64500.
  3. Configure local-as 64501 private no-prepend-global-as on every EBGP session of each RR client migrated in step 2.

This migration procedure has several advantages. First, customers, settlement-free peers and transit providers of the previous ASN 64501 network still perceive that they are peering with ASN 64501 and can delay switching to ASN 64500 until the time is convenient for them. Second, the AS path lengths of the routes exchanged with the EBGP peers are unchanged from before so that best path selections are preserved.

5.5.2.3. 4-Octet Autonomous System Numbers

When BGP was developed, it was assumed that 16-bit (2-octet) ASNs would be sufficient for global Internet routing. In theory a 16-bit ASN allows for 65536 unique autonomous systems but some of the values are reserved (0 and 64000-65535). Of the assignable space less than 10% remains available. When a new AS number is needed it is now simpler to obtain a 4-octet AS number. 4-octet AS numbers have been available since 2006. A 32-bit (4-octet) ASN allows for 4,294,967,296 unique values (some of which are again, reserved).

When 4-octet AS numbers became available it was recognized that not all routers would immediately support the ability to parse 4-octet AS numbers in BGP messages so two optional transitive attributes called AS4_PATH and AS4_AGGREGATOR were introduced to allow a gradual migration.

A BGP router that supports 4-octet AS numbers advertises this capability in its OPEN message; the capability information includes the AS number of the sending BGP router, encoded using 4 bytes (recall the ASN field in the OPEN message is limited to 2 bytes). By default, OPEN messages sent by 7450, 7750, or 7950 routers always include the 4-octet ASN capability, but this can changed using the disable-4byte-asn command.

If a BGP router and its peer have both announced the 4-octet ASN capability, then the AS numbers in the AS_PATH and AGGREGATOR attributes are always encoded as 4-byte values in the UPDATE messages they send to each other. These UPDATE messages should not contain the AS4_PATH and AS4_AGGREGATOR path attributes.

If one of the routers involved in a session announces the 4-octet ASN capability and the other one does not, then the AS numbers in the AS_PATH and AGGREGATOR attributes are encoded as 2-byte values in the UPDATE messages they send to each other.

When a 7450, 7750, or 7950 router advertises a route to a peer that did not announce the 4-octet ASN capability:

  1. If there are any AS numbers in the AS_PATH attribute that cannot be represented using 2 bytes (because they have a value greater than 65535) they are substituted with the special value 23456 (AS_TRANS) and an AS4_PATH attribute is added to the route if it is not already present. The AS4_PATH attribute has the same encoding as the AS_PATH attribute that would be sent to a 4-octet ASN capable router (that is, each AS number is encoded using 4 octets) but it does not carry segments of type AS_CONFED_SEQUENCE or AS_CONFED_SET.
  2. If the AS number in the AGGREGATOR attribute cannot be represented using 2 bytes (because its value is greater than 65535) it is substituted with the special value 23456 and as AS4_AGGREGATOR attribute is added to the route if it is not already present. The AS4_AGGREGATOR is the same as the AGGREGATOR attribute that would be sent to a 4-octet ASN capable router (that is, the AS number is encoded using 4 octets).

When a 7450, 7750, or 7950 router receives a route with an AS4_PATH attribute it attempts to reconstruct the full AS path from the AS4_PATH and AS_PATH attributes, regardless of whether disable-4byte-asn is configured or not. The reconstructed path is the AS path displayed in BGP show commands. If the length of the received AS4_PATH is N and the length of the received AS_PATH is N+t, then the reconstructed AS path contains the t leading elements of the AS_PATH followed by all the elements in the AS4_PATH.

5.5.3. Next-Hop

The NEXT_HOP attribute indicates the IPv4 address of the BGP router that is the next-hop to reach the IPv4 prefixes in the NLRI field. If the Update message is advertising routes other than IPv4 unicast routes the next-hop of these routes is encoded in the MP_REACH_NLRI attribute; see Multi-Protocol BGP Attributes for more details.

The rules for deciding what next-hop address types to accept in a received BGP route and what next-hop address types to advertise as a BGP next-hop are address family dependent. The following sections summarize the key details.

5.5.3.1. Unlabeled IPv4 Unicast Routes

By default, IPv4 routes are advertised with IPv4 next-hops but on IPv6-TCP transport sessions they can be advertised with IPv6 next-hops if the advertise-ipv6-next-hops command (with the IPv4 option) applies to the session. In order to receive IPv4 routes with IPv6 next-hop addresses from a peer, the extended-nh-encoding command (with the IPv4 option) must be applied to the session. This advertises the corresponding RFC 5549, Advertising IPv4 Network Layer Reachability Information with an IPv6 Next Hop, capability to the peer.

Whenever next-hop-self applies to an IPv4 route, the next hop is set as follows:

  1. If the peer whose routes are being advertised is an IPv4 transport peer (in other words, the neighbor address is IPv4), the BGP next-hop is the IPv4 local address used to setup the session.
  2. If the peer towards which the routes are being advertised is an IPv6 transport peer (in other words, the neighbor address is IPv6), and the advertise-ipv6-next-hops command (with the ipv4 option) applies to the session, and the peer towards which the routes are being advertised opened the session by announcing an Extended NH Encoding capability for AFI =1, SAFI = 1 and next-hop AFI = 2, then the BGP next-hop is the IPv6 local address used to setup the session. Otherwise, for all other cases, the BGP next-hop is the IPv4 address of the system interface. If the system interface does not have an IPv4 address, the route is not advertised unless an export policy sets a valid IPv4 next-hop.

When an IPv4 BGP route is advertised to an EBGP peer, next-hop-self always applies except if the third-party-nexthop command is applied. Configuring third-party-nexthop allows an IPv4 route received from one EBGP peer to be advertised to another EBGP that is in the same IP subnet with an unchanged BGP next-hop.

When an IPv4 BGP route is re-advertised to an IBGP or confederation EBGP peer, the advertising router does not modify the BGP next-hop unless one of the following applies:

  1. The BGP next-hop-self command is applied to the IBGP or confederation EBGP peer. This causes next-hop-self to be applied to all IPv4 routes advertised to the peer, regardless of the peer type from which they were received (IBGP, confed-EBGP, or EBGP).
  2. IPv4 routes are matched and accepted by a route policy entry, and this entry has a next-hop-self action. This applies next-hop-self as described above to only those routes matched by the policy entry.
  3. IPv4 routes are matched and accepted by route policy entry, and this entry has a next-hop ip-address action. This changes the BGP next-hop of only the matched routes to the ip-address, if the ip-address is an IPv4 address or if the ip-address is an IPv6 address and the necessary conditions exist. The advertise-ipv6-next-hops command is configured appropriately and the peer opened the session with the correct RFC 5549 capability.

When an IPv4 BGP route is locally originated and advertised to an IBGP or confederation EBGP peer, the BGP next-hop is, by default, copied from the next hop of the route that was imported into BGP, with certain exceptions (for example, black-hole next-hop). When a static route with indirect next hop is re-advertised as a BGP route, the BGP next-hop is a copy of the indirect address. However, with route table import policies, BGP can be instructed to take the resolved next hop of the static route as the BGP next-hop address.

5.5.3.2. Unlabeled IPv6 Unicast Routes

SR OS routers never send or receive IPv6 routes with 32-bit IPv4 next-hop addresses.

When an IPv6 BGP route is advertised to an EBGP peer, next-hop-self always applies (except if the third-party-nexthop command is applied, as described in the following note). Next-hop-self results in one of the following outcomes:

  1. If the EBGP session uses IPv4 transport, the BGP next-hop encodes the local-address used for setup of the session as an IPv4-compatible IPv6 address (all zeros in the first 96 bits followed by the 32 bit IPv4 local-address).
  2. If the EBGP session uses IPv6 transport, the BGP next-hop is the local-address used to setup the session and this cannot be overridden, even by BGP export policy.
Note:

Configuring third-party-nexthop allows an IPv6 route received from one EBGP peer to be advertised to another EBGP that is in the same IP subnet with an unchanged BGP next-hop.

When an IPv6 BGP route is re-advertised to an IBGP or confederation-EBGP peer, the advertising router does not modify the BGP next-hop by default; however, this can be changed as follows:

  1. If the BGP next-hop-self command is applied to the IBGP peer or confederation-EBGP peer, then this changes the BGP next-hop to the local-address used to setup the session (if the transport to the peer is IPv6) or to an IPv4-compatible IPv6 address derived from the IPv4 local-address used to setup the session (if the transport to the peer is IPv4). This command applies to all routes advertised to the peer, regardless of the peer type from which they were received (IBGP, confed-EBGP, or EBGP).
  2. If IPv6 routes are matched and accepted by an export policy applied to an IBGP or confederation-EBGP session, and the matching policy entry has a next-hop-self action, this changes the BGP next-hop of only the matched routes to the local-address used to setup the session (if the transport to the peer is IPv6) or to an IPv4-compatible IPv6 address derived from the IPv4 local-address used to setup the session (if the transport to the peer is IPv4).
  3. If IPv6 routes are matched and accepted by an export policy applied to an IBGP or confederation-EBGP session, and the matching policy entry has a next-hop <ip-address> action, this changes the BGP next-hop of only the matched routes to <ip-address>, but only if <ip-address> is an IPv6 address. If <ip-address> is an IPv4 address the matched routes are treated as though they were rejected by the policy entry.

When an IPv6 BGP route is locally originated and advertised to an IBGP or confederation- EBGP peer, the BGP next-hop is, by default, copied from the next-hop of the route that was imported into BGP, with certain exceptions (for example, black-hole next-hop). When a static route with indirect next-hop is re-advertised as a BGP route, the BGP next-hop is a copy of the indirect address, however with route-table-import policies BGP can be instructed to take the resolved next-hop of the static route as the BGP next-hop address.

5.5.3.3. VPN-IPv4 Routes

SR OS routers can always send and receive VPN-IPv4 routes with IPv4 next-hops. They can also be configured (using the extended-nh-encoding command) to receive VPN-IPv4 routes with IPv6 next-hop addresses from selected BGP peers by signaling the corresponding Extended NH Encoding BGP capability to those peers during session setup. If the capability is not advertised to a peer, then such routes are not accepted from that peer. Also, if the SR OS router does not receive an Extended NH Encoding capability advertisement for [NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2] from a peer, then it does not advertise VPN-IPv4 routes with IPv6 next-hops to that peer.

When a VPN-IPv4 BGP route is advertised to an EBGP peer, next-hop-self applies if enable-inter-as-vpn is configured; otherwise there is no change to the next-hop. Next-hop-self results in one of the following outcomes:

  1. If the EBGP session uses IPv4 transport, then the BGP next-hop is taken from the value of the local-address used to setup the session.
  2. If the EBGP peer opened an IPv6-transport session by advertising an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) AND, in the configuration of the local router, the session is associated with an advertise-ipv6-next-hops vpn-ipv4 command, then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface.

If enable-inter-as-vpn is configured, a VPN-IPv4 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer, next-hop-self applies. (If enable-inter-as-vpn is not configured, the next-hop may be changed with the next-hop-self command, however, this is strongly discouraged because it can result in a change of the next-hop without a change in the VPN label.) Next-hop-self results in one of the following outcomes:

  1. If the IBGP or confederation-EBGP session uses IPv4 transport, then the BGP next-hop is taken from the value of the local-address used to setup the session.
  2. If the IBGP or confederation-EBGP peer opened an IPv6-transport session by advertising an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) AND, in the configuration of the local router, the session is associated with an advertise-ipv6-next-hops vpn-ipv4 command then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface.

When a VPN-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer, the RR does not modify the next-hop by default. However, if the next-hop-self command is applied to the IBGP peer receiving the route and enable-rr-vpn-forwarding is configured then this combination of commands has one of the following outcomes:

  1. If the IBGP session receiving the reflected route uses IPv4 transport, then the BGP next-hop is taken from the value of the local-address used to setup the session.
  2. If the IBGP session receiving the reflected route opened an IPv6-transport session by advertising an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) AND, in the configuration of the local router, the session is associated with an advertise-ipv6-next-hops vpn-ipv4 command then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface.

When a VPN-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer and enable-rr-vpn-forwarding is configured and the VPN-IPv4 route is matched and accepted by an export policy entry with a next-hop <ip-address> action, this changes the BGP next-hop of the matched routes to <ip-address>, except if <ip-address> is an IPv6 address and the receiving IBGP peer did not advertise an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) or, in the configuration of the local router, the session is not associated with an advertise-ipv6-next-hops vpn-ipv4 command. In this case, the route is treated as though it was rejected by the policy entry.

5.5.3.4. VPN-IPv6 Routes

SR OS routers never send or receive VPN-IPv6 routes with 32-bit IPv4 next-hop addresses.

When a VPN-IPv6 BGP route is advertised to an EBGP peer, next-hop-self applies if enable-inter-as-vpn is configured. Otherwise, there is no change to the next-hop. Next-hop-self results in one of the following outcomes:

  1. If the EBGP session uses IPv4 transport, then the BGP next-hop is set to the IPv4 local-address used to setup the session but encoded as an IPv4-mapped IPv6 address (for example, with the IPv4 address in the least significant 32 bits of a ::FFFF/96 prefix).
  2. If the EBGP session uses IPv6 transport and it is associated with an advertise-ipv6-next-hops vpn-ipv6 command, then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface encoded as an IPv4-mapped IPv6 address (for example, with the IPv4 address in the least significant 32 bits of a ::FFFF/96 prefix).

If enable-inter-as-vpn is configured, a VPN-IPv6 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer next-hop-self applies. (If enable-inter-as-vpn is not configured, the next-hop may be changed with the next-hop-self command, however, this is strongly discouraged because it can result in a change of the next-hop without a change in the VPN label.) Next-hop-self results in one of the following outcomes:

  1. If the IBGP or confederation-EBGP session uses IPv4 transport, then the BGP next-hop is set to the IPv4 local-address used to setup the session but encoded as an IPv4-mapped IPv6 address (for example, with the IPv4 address in the least significant 32 bits of a ::FFFF/96 prefix).
  2. If the IBGP or confederation-EBGP session uses IPv6 transport and it is associated with an advertise-ipv6-next-hops vpn-ipv6 command, then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface encoded as an IPv4-mapped IPv6 address (for example, with the IPv4 address in the least significant 32 bits of a ::FFFF/96 prefix).

When a VPN-IPv6 BGP route is reflected from one IBGP peer to another IBGP peer, the RR does not modify the next-hop by default. However, if the next-hop-self command is applied to the IBGP peer receiving the route and enable-rr-vpn-forwarding is configured, then this combination of commands has one of the following outcomes:

  1. If the IBGP session receiving the reflected route uses IPv4 transport then the BGP next-hop is set to the IPv4 local-address used to setup the session but encoded as an IPv4-mapped IPv6 address (for example, with the IPv4 address in the least significant 32 bits of a ::FFFF/96 prefix).
  2. If the IBGP session receiving the reflected route uses IPv6 transport and it is associated with an advertise-ipv6-next-hops vpn-ipv6 command, then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface encoded as an IPv4-mapped IPv6 address (for example, with the IPv4 address in the least significant 32 bits of a ::FFFF/96 prefix).

When a VPN-IPv6 BGP route is reflected from one IBGP peer to another IBGP peer and enable-rr-vpn-forwarding is configured and the VPN-IPv6 route is matched and accepted by an export policy entry with a next-hop <ip-address> action, this changes the BGP next-hop of the matched routes to <ip-address> if it is specified as a 128-bit IPv6 address or to an IPv4-mapped IPv6 address encoding <ip-address> if it is specified as a 32-bit IPv4 address.

5.5.3.5. Label-IPv4 Routes

SR OS routers can always send and receive label-IPv4 routes with IPv4 next-hops. They can also be configured (using the extended-nh-encoding command) to receive label-IPv4 routes with IPv6 next-hop addresses from selected BGP peers by signaling the corresponding Extended NH Encoding BGP capability to those peers during session setup. If the capability is not advertised to a peer then such routes are not accepted from that peer. Also, if the SR OS router does not receive an Extended NH Encoding capability advertisement for [NLRI AFI=1, NLRI SAFI=4, next-hop AFI=2] from a peer then it is not advertise label-IPv4 routes with IPv6 next-hops to that peer.

When a label-IPv4 BGP route is advertised to an EBGP peer, next-hop-self applies unless the EBGP session has next-hop-unchanged enabled for the label-ipv4 address family. Next-hop-self results in one of the following outcomes:

  1. If the EBGP session uses IPv4 transport, then the BGP next-hop is taken from the value of the local-address used to setup the session.
  2. If the EBGP peer opened an IPv6-transport session by advertising an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=4, next-hop AFI=2) AND, in the configuration of the local router, the session is associated with an advertise-ipv6-next-hops label-ipv4 command, then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface.

When a label-IPv4 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer, next-hop-self applies unless the IBGP or confederation-EBGP session has next-hop-unchanged enabled for the label-ipv4 address family. Next-hop-self results in one of the following outcomes:

  1. If the IBGP or confederation-EBGP session uses IPv4 transport, then the BGP next-hop is taken from the value of the local-address used to setup the session.
  2. If the IBGP or confederation-EBGP peer opened an IPv6-transport session by advertising an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=4, next-hop AFI=2) AND, in the configuration of the local router, the session is associated with an advertise-ipv6-next-hops label-ipv4 command then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface

When a label-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer, the RR does not modify the next-hop by default. However, if the next-hop-self command is applied to the IBGP peer receiving the route, then this results in one of the following outcomes:

  1. If the IBGP session receiving the reflected route uses IPv4 transport, then the BGP next-hop is taken from the value of the local-address used to setup the session.
  2. If the IBGP session receiving the reflected route opened an IPv6-transport session by advertising an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=4, next-hop AFI=2) AND, in the configuration of the local router, the session is associated with an advertise-ipv6-next-hops label-ipv4 command then the BGP next-hop is set to the value of the IPv6 local-address used for setup of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface.

When a label-IPv4 BGP route is reflected from one IBGP peer to another IBGP peer and the label-IPv4 route is matched and accepted by an export policy entry with a next-hop <ip-address> action, this changes the BGP next-hop of the matched routes to <ip-address>, except if <ip-address> is an IPv6 address and the receiving IBGP peer did not advertise an extended NH encoding capability with (NLRI AFI=1, NLRI SAFI=128, next-hop AFI=2) or, in the configuration of the local router, the session is not associated with an advertise-ipv6-next-hops label-ipv4 command. In this case, the route is treated as though it was rejected by the policy entry.

5.5.3.6. Label-IPv6 Routes

SR OS routers never send or receive label-IPv6 routes with 32-bit IPv4 next-hop addresses.

When a label-IPv6 BGP route is advertised to an EBGP peer, next-hop-self command applies unless the EBGP session has next-hop-unchanged command enabled for the label-ipv6 address family. next-hop-self results in one of the following outcomes:

  1. If the EBGP session uses IPv4 transport, then the BGP next-hop is taken from the value of the local address used to set up the session and encoded as an IPv4-mapped IPv6 address.
  2. If the EBGP peer opened an IPv6 transport session and it is associated with an advertise-ipv6-next-hops label-ipv6 command then the BGP next-hop is set to the value of the IPv6 local address used for set up of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface encoded as an IPv4-mapped IPv6 address.

When a label-IPv6 BGP route is received from an EBGP peer and re-advertised to an IBGP or confederation-EBGP peer, next-hop-self applies unless the IBGP or confederation-EBGP session has the next-hop-unchanged command enabled for the label-IPv6 address family. The next-hop-self results in one of the following:

  1. If the IBGP or confederation-EBGP session uses IPv4 transport, then the BGP next-hop is taken from the value of the local address used to setup the session and encoded as an IPv4-mapped IPv6 address.
  2. If the IBGP or confederation-EBGP peer opened an IPv6 transport session and it is associated with an advertise-ipv6-next-hops label-ipv6 command then the BGP next-hop is set to the value of the IPv6 local address used for set up of the session. Otherwise, the BGP next-hop is set to the IPv4 address of the system interface encoded as an IPv4-mapped IPv6 address.

5.5.3.7. Next-Hop Resolution

To use a BGP route for forwarding, a BGP router must know how to reach the BGP next-hop of the route. The process of determining the local interface or tunnel used to reach the BGP next-hop is called next-hop resolution. The BGP next-hop resolution process depends on the type of route (the AFI/SAFI) and various configuration settings.

  1. The following procedures apply to the next-hop resolution of IPv4 and IPv6 (unlabeled) unicast routes:
    1. When an IPv6 route is received with an IPv4-mapped IPv6 address as next-hop, BGP next-hop resolution is based on the embedded IPv4 address and this IPv4 address is matched to IPv4 routes or IPv4 tunnels, depending on configuration. The IPv4-mapped IPv6 address is never interpreted as an IPv6 address to be compared to IPv6 routes or IPv6 tunnels.
    2. BGP routes are eligible to resolve a BGP next-hop only if the use-bgp-routes command is configured. A maximum of four levels of recursion are supported.
    3. If there are multiple eligible routes that match the BGP next-hop, the longest prefix match (LPM) route is selected.
    4. If the LPM route is rejected by the user-configured next-hop-resolution policy or if there are no eligible matching routes, the BGP next-hop is unresolved and all the routes with that next-hop are considered invalid and not advertised to peers.
  2. The following procedures apply to the next-hop resolution of VPN-IPv4 and VPN-IPv6 routes:
    1. When an VPN-IPv4 or VPN-IPv6 route is received with an IPv4-mapped IPv6 address as next-hop, BGP next-hop resolution is based on the embedded IPv4 address and this IPv4 address is matched to IPv4 routes and/or IPv4 tunnels, depending on configuration. The IPv4-mapped IPv6 address is never interpreted as an IPv6 address to be compared to IPv6 routes or IPv6 tunnels.
    2. If the VPN-IP route is imported into a VPRN, the next-hop is resolved to a tunnel based on the auto-bind-tunnel configuration of the importing VPRN. Refer to the 7450 ESS, 7750 SR, 7950 XRS, and VSR Layer 3 Services Guide: IES and VPRN for more information.
    3. If the next-hop entry in the tunnel-table that resolves the VPN-IP route is rejected by the user-configured next-hop-resolution vpn-family-policy, the BGP next hop is unresolved and all the VPN-IP routes with that next hop are considered invalid.
    4. If the VPN-IP route is received from an IBGP or EBGP peer, and the router is a next-hop-self RR or a model-B ASBR, the order of resolution is as follows:
      1. local (interface) route
      2. longest prefix-match static route, excluding default static routes, but only if allow-static is configured
      3. tunnel, based on the transport-tunnel resolution-filter options for family VPN. See Next-Hop Resolution of BGP Labeled Routes to Tunnels for more information
  3. The following procedures apply to the next-hop resolution of EVPN routes:
    1. If an EVPN ES route is imported, the next hop is resolved to any tunnel in the tunnel-table. If there is no entry matching the next hop in the tunnel-table, a route matching the next hop in the route-table also resolves the next hop.
      1. If the resolving route in the tunnel-table or the route-table is rejected by a user-configured next-hop-resolution vpn-family-policy, the ES route's next hop is unresolved.
      2. If the ES route next hop is unresolved, the PE that advertised the route is not considered as candidate for Designated Forwarder (DF) election.
    2. The imported AD per-ES and AD per-EVI routes are always shown as resolved and valid, irrespective of the next-hop resolution or the configuration of a next-hop-resolution vpn-family-policy.
      1. With DF election, the router always considers the advertising PE a valid candidate, even if the AD routes next hops are unresolved.
      2. However, if the AD per-EVI next hop is unresolved, EVPN traffic is not sent to the advertising PE. This is true for EVPN VPWS or multi-homing aliasing or backup procedures.
      3. A matching tunnel-table entry can resolve the next-hop of an AD per-EVI route in an EVPN-MPLS service. A route-table entry (other than a shortcut) can resolve the AD per-EVI route next-hop in an EVPN-VXLAN service.
    3. For any other imported EVPN service route, including IP prefix, IPv6 prefix, MAC/IP, Inclusive Multicast (IMET) and Selective Multicast (SMET) routes, the next hop is resolved as follows.
      1. In EVPN-MPLS services, the next hop is resolved to a tunnel based on the auto-bind-tunnel configuration of the importing service.
      2. In EVPN-VXLAN services, the next hop is resolved to a route in the route-table. That route cannot be a shortcut route.
      3. If the next-hop entry in the tunnel-table or route-table that resolves the EVPN service route is rejected by the user-configured next-hop-resolution vpn-family-policy, the BGP next hop is unresolved and all the EVPN routes with that next hop are considered invalid.
  4. The following procedures apply to the next-hop resolution of label-unicast IPv4 and label-unicast IPv6 routes:
    1. When a BGP-LU route is received with an IPv4-mapped IPv6 address as the next-hop, BGP next-hop resolution is based on the embedded IPv4 address and this IPv4 address is matched to IPv4 routes or IPv4 tunnels, depending on the configuration. The IPv4-mapped IPv6 address is never interpreted as an IPv6 address to be compared to IPv6 routes or IPv6 tunnels.
    2. If the BGP-LU route is received by a control-plane-only RR with the disable-route-table-install and rr-use-route-table commands configured, the order of resolution is as follows:
      1. the local route
      2. the longest prefix-match static route, excluding default static routes
        Use this route to resolve the BGP next-hop if it is a static route with blackhole next-hop. For other types of static routes, use this route to resolve the next-hop only if allow-static is configured. If it is another type of static route, use it to resolve the next-hop only if allow-static is configured.
      3. the longest prefix-match IGP route
    3. If a label-unicast IPv4 route or label-unicast IPv6 route with a label (other than explicit-null) is received from an IBGP or EBGP peer and the router is not an RR with the rr-use-route-table command configured, the order of resolution is as follows:
      1. the local route
      2. the longest prefix-match static route, excluding default static routes
        Use this route to resolve the BGP next-hop if it is a static route with a blackhole next-hop. If it is another type of static route, use it to resolve the next-hop only if allow-static is configured.
      3. a tunnel, based on the transport-tunnel resolution-filter options for family label-ipv4 or label-ipv6, depending on the situation. See Next-Hop Resolution of BGP Labeled Routes to Tunnels for more information.
    4. If a label-unicast IPv6 route with an IPv6 explicit-null label is received from an IBGP or EBGP peer and the router is not an RR with the rr-use-route-table command configured, the order of resolution is as follows:
      1. the local route
      2. the longest prefix-match static route, excluding default static routes
        Use this route to resolve the BGP next-hop if it is a static route with a a blackhole next-hop. If it is another type of static route, use it to resolve the next-hop only if allow-static is configured
      3. a tunnel, based on the transport-tunnel resolution-filter options for family label-ipv4; see Next-Hop Resolution of BGP Labeled Routes to Tunnels for more information
      4. if configure>router>bgp>next-hop-resolution>labeled-routes>use-bgp-routes>label-ipv6-explicit-null is enabled and if the longest prefix length route that matches the next-hop is a BGP IPv4 unlabeled, BGP IPv6 unlabeled, or other 6PE route with an explicit-null label, then use that route, subject to the following conditions:
        - the resolving route cannot be a leaked route.
        - an unlabeled IPv4 route or IPv6 route is ineligible to resolve the next-hop of a label-unicast IPv6 route if the unlabeled route has any of its own BGP next-hops resolved by an IGP route or a 6over4 route.
        - the label-unicast IPv6 route can be recursively resolved by other label-unicast IPv6 routes with explicit-null so that the final route has up to four levels of recursion.

5.5.3.7.1. Next-Hop Resolution of BGP Unlabeled IPv4 Unicast Routes to Tunnels

To enable the next-hop resolution of unlabeled IPv4 routes using tunnels in the tunnel-table of the router, it is necessary to add family ipv4 under the config>router>bgp>next-hop-resolution>shortcut-tunnel context. In the family ipv4 context, there are commands to specify the resolution mode (any, disabled or filter) and the set of tunnel types that are eligible to be used if filter mode (resolution filter) is selected.

If the resolution mode is set to disabled, the next-hops of unlabeled IPv4 routes can only be resolved by route table lookup.

If there are multiple tunnels in tunnel-table that are allowed by resolution any or resolution filter mode and that can resolve the BGP next-hop, the selection of the resolving tunnel comes down to factors such as route color, admin-tag-policy, tunnel-table preference, and LDP FEC prefix length.

If disallow-igp is enabled, then there is no attempt to resolve the IPv4 BGP route using route table lookup if no resolving tunnel can be found in the tunnel-table.

The available tunneling options for IPv4 shortcuts are:

  1. bgp — This refers to IPv4 and IPv6 tunnels created by receiving BGP label-unicast IPv4 routes for /32 IPv4 prefixes and BGP label-unicast IPv6 routes for /128 IPv6 prefixes. The installation of BGP-LU IPv6 tunnels in TTM requires disable-explicit-null to be executed or explicit-null to be configured false in MD-CLI.
  2. ldp — This refers to LDP FEC prefixes imported into the tunnel table. For resolution purposes, BGP selects the LDP FEC that is the longest-prefix-match (LPM) of the BGP next-hop address.
  3. rsvp — This refers to RSVP tunnels in tunnel-table. This option allows BGP to use the best metric RSVP LSP to the address of the BGP next-hop. This address can correspond to the system interface or to another loopback interface of the remote BGP router. In the case of multiple RSVP LSPs with the same lowest metric, BGP selects the LSP with the lowest tunnel-id.
  4. sr-isis — This refers to segment routing tunnels (shortest path) to destinations reachable by the IS-IS protocol. This option allows BGP to use the segment routing tunnel in tunnel-table submitted by the lowest preference IS-IS instance or (in case of a tie) the lowest numbered IS-IS instance.
  5. sr-ospf — This refers to segment routing tunnels (shortest path) to destinations reachable by the OSPF protocol. This option allows BGP to use the segment routing tunnel in tunnel-table submitted by the lowest preference OSPF instance or (in case of a tie) the lowest numbered OSPF instance.
  6. sr-policy — This refers to segment routing policies that are statically configured in the local router or learned via BGP routes (AFI 1/SAFI 73). For BGP to resolve the next-hop of an IPv4 route using an sr-policy the highest numbered color extended community attached to the IPv4 route must match the color of the SR policy and if the CO bits of this color extended community have the value 00 the BGP next-hop of the route must exactly match the endpoint of the SR policy.
  7. sr-te — This refers to traffic engineered (TE) segment routing tunnels. This option allows BGP to use the best metric SR-TE tunnel to the address of the BGP next-hop. In the case of multiple SR-TE tunnels with the same lowest metric, BGP selects the tunnel with the lowest tunnel-id.

5.5.3.7.2. Next-Hop Resolution of BGP Unlabeled IPv6 Unicast Routes to Tunnels

To enable the next-hop resolution of unlabeled IPv6 routes using tunnels in the tunnel-table of the router, it is necessary to add family ipv6 under the config>router>bgp>next-hop-resolution>shortcut-tunnel context. If the next-hop of the IPv6 BGP route contains an IPv4-mapped IPv6 address, the shortcut-tunnel configuration applies to the use of IPv4 tunnels and IPv4 routes that match the embedded IPv4 address in the BGP next-hop. If the BGP next-hop is any other IPv6 address the shortcut-tunnel configuration applies to the use of IPv6 tunnels and IPv6 routes that match the full address of the BGP next-hop.

The family ipv6 context has commands to specify the resolution mode (any, disabled or filter) and the set of tunnel types that are eligible to be used if filter mode (resolution filter) is selected.

If there are multiple tunnels in tunnel-table that are allowed by resolution any or resolution filter mode and that can resolve the BGP next-hop, the selection of the resolving tunnel comes down to factors such as route color, admin-tag-policy, tunnel-table preference, and LDP FEC prefix length.

If the resolution mode is set to disabled, the next-hops of unlabeled IPv6 routes can only be resolved by route table lookup.

If there are multiple tunnels in tunnel-table that are allowed by resolution any or resolution filter mode and that can resolve the BGP next-hop, the selection of the resolving tunnel comes down to factors such as route color, admin-tag-policy, tunnel-table preference, and LDP FEC prefix length.

If disallow-igp is enabled, then there is no attempt to resolve the IPv6 BGP route using route table lookup if no resolving tunnel can be found in the tunnel-table.

The available tunneling options for IPv6 BGP routes with IPv4-mapped IPv6 next-hops are:

  1. bgp — This refers to IPv4 tunnels created by receiving BGP label-unicast IPv4 routes for /32 IPv4 prefixes.
  2. ldp — This refers to /32 and shorter length LDP FEC prefixes imported into the tunnel table. For resolution purposes, BGP selects the LDP FEC that is the longest-prefix-match (LPM) of the BGP next-hop address.
  3. rsvp — This refers to RSVP tunnels in tunnel-table. This option allows BGP to use the best metric RSVP LSP to the address of the BGP next-hop. This address can correspond to the system interface or to another loopback interface of the remote BGP router. In the case of multiple RSVP LSPs with the same lowest metric, BGP selects the LSP with the lowest tunnel-id.
  4. sr-isis — This refers to segment routing tunnels (shortest path) to destinations reachable by the IS-IS protocol. This option allows BGP to use the segment routing tunnel in tunnel-table submitted by the lowest preference IS-IS instance or (in case of a tie) the lowest numbered IS-IS instance.
  5. sr-ospf — This refers to segment routing tunnels (shortest path) to destinations reachable by the OSPF protocol. This option allows BGP to use the segment routing tunnel in tunnel-table submitted by the lowest preference OSPF instance or (in case of a tie) the lowest numbered OSPF instance.
  6. sr-policy — This refers to segment routing policies that are statically configured in the local router or learned via BGP routes (AFI 1/SAFI 73). For BGP to resolve the next-hop of an IPv4 route using an sr-policy the highest numbered color extended community attached to the IPv4 route must match the color of the SR policy and if the CO bits of this color extended community have the value 00 the BGP next-hop of the route must exactly match the endpoint of the SR policy.
  7. sr-te — This refers to traffic engineered (TE) segment routing tunnels. This option allows BGP to use the best metric SR-TE tunnel to the address of the BGP next-hop. In the case of multiple SR-TE tunnels with the same lowest metric, BGP selects the tunnel with the lowest tunnel-id.

5.5.3.7.3. Next-Hop Resolution of BGP Labeled Routes to Tunnels

Use the following CLI syntax to configure next-hop resolution of BGP labeled routes.

CLI Syntax:
config>router>bgp>next-hop-res
labeled-routes
transport-tunnel
[no] family {label-ipv4|label-ipv6|vpn}
resolution {any | disabled | filter}
resolution-filter
[no] bgp
[no] ldp
[no] rsvp
[no] sr-isis
[no] sr-ospf
[no] sr-te
[no] udp

The transport-tunnel context provides separate control for the different types of BGP label routes: label-IPv4, label-IPv6, and VPN routes (which includes both VPN-IPv4 and VPN-IPv6 routes). By default, all labeled routes resolve to LDP (even if the preceding CLI commands are not configured in the system).

If the resolution option is set to disabled, the default binding to LDP tunnels resumes. If resolution is set to any, the supported tunnel type selection is based on TTM preference. The order of preference of TTM tunnels is: RSVP, SR-TE, LDP, segment routing OSPF, segment routing IS-IS, and UDP.

The rsvp option instructs BGP to search for the best metric RSVP LSP to the address of the BGP next-hop. The address can correspond to the system interface or to another loopback used by the BGP instance on the remote node. The LSP metric is provided by MPLS in the tunnel table. In the case of multiple RSVP LSPs with the same lowest metric, BGP selects the LSP with the lowest tunnel ID.

The ldp option instructs BGP to search for an LDP LSP with a FEC prefix corresponding to the address of the BGP next-hop.

The bgp option instructs BGP to search for a BGP tunnel in TTM with a prefix matching the address of the BGP next-hop. A label-unicast IPv4 route cannot be resolved by another label-unicast IPv4 or IPv6 route. A label-unicast IPv6 route cannot be resolved by another label-unicast IPv6 route, but it can be resolved by a label-unicast IPv4 route.

When the sr-isis or sr-ospf option is enabled, an SR tunnel to the BGP next-hop is selected in the TTM from the lowest preference IS-IS or OSPF instance. If many instances have the same lowest preference, the lowest numbered IS-IS or OSPF instance is chosen.

The sr-te value launches a search for the best metric SR-TE LSP to the address of the BGP next-hop. The LSP metric is provided by MPLS in the tunnel table. In the case of multiple SR-TE LSPs with the same lowest metric, BGP selects the LSP with the lowest tunnel-id.

The udp value instructs BGP to look for an MPLSoUDP tunnel to the address of the BGP next-hop.

If one or more explicit tunnel types are specified using the resolution-filter option, then only these tunnel types are selected again following the TTM preference. The resolution command must be set to filter to activate the list of tunnel-types configured in resolution-filter.

5.5.3.8. Next-Hop Tracking

In SR OS next-hop resolution is not a one-time event. If the IP route or tunnel that was used to resolve a BGP next-hop is withdrawn due to a failure or configuration change an attempt is made to re-resolve the BGP next-hop using the next-best route or tunnel. If there are no more eligible routes or tunnels to resolve the BGP next-hop then the BGP next-hop becomes unresolved. The continual process of monitoring and reacting to resolving route/tunnel changes is called next-hop tracking. In SR OS next-hop tracking is completely event driven as opposed to timer driven; this provides the best possible convergence performance.

5.5.3.9. Next-Hop Indirection

SR OS supports next-hop indirection for most types of BGP routes. Next-hop indirection means BGP next-hops are logically separated from resolved next-hops in the forwarding plane (IOMs). This separation allows routes that share the same BGP next-hops to be grouped so that when there is a change to the way a BGP next-hop is resolved only one forwarding plane update is needed, as opposed to one update for every route in the group. The convergence time after the next-hop resolution change is uniform and not linear with the number of prefixes; in other words, the next-hop indirection is a technology that supports prefix independent convergence (PIC). SR OS uses next-hop indirection whenever possible; there is no option to disable the functionality.

5.5.3.10. Entropy Label for RFC 3107 BGP Labeled Routes

The router supports the MPLS entropy label, as specified in RFC 6790, on RFC 3107 BGP labeled routes. LSR nodes in a network can load-balance labeled packets in a more granular way than by hashing on the standard label stack. Refer to the MPLS Guide for more information.

Entropy Label Capability (ELC) signaling is not supported for labeled routes representing BGP tunnels. Instead, ELC is configured at the head end LER using the configure router bgp override-tunnel-elc command. This command causes the router to ignore any advertisements for ELC that may or may not be received from the network, and instead to assume that the whole domain supports entropy labels.

5.5.4. MED

The Multi-Exit Discriminator (MED) attribute is an optional attribute that can be added to routes advertised to an EBGP peer to influence the flow of inbound traffic to the AS. The MED attribute carries a 32-bit metric value. A lower metric is better than a higher metric when MED is compared by the BGP decision process. Unless the always-compare-med command is configured MED is compared only if the routes come from the same neighbor AS. By default, if a route is received without a MED attribute it is evaluated by the BGP decision process as though it had a MED containing the value 0, but this can be changed so that a missing MED attribute is handled the same as a MED with the maximum value. SR OS always removes the received MED attribute when advertising the route to an EBGP peer.

5.5.4.1. Deterministic MED

Deterministic MED is an optional enhancement to the BGP decision process that causes BGP to group paths that are equal up to the MED comparison step based on the neighbor AS. BGP compares the best path from each group to arrive at the overall best path. This change to the BGP decision process makes best path selection completely deterministic in all cases. Without deterministic-med, the overall best path selection is sometimes dependent on the order of route arrival because of the rule that MED cannot be compared in routes from different neighbor AS.

Note:

When BGP routes are leaked into a target BGP RIB, they are not grouped (in a deterministic MED context) with routes learned by that target RIB, even if the neighbor AS happens to be the same.

5.5.5. Local Preference

The LOCAL_PREF attribute is a well-known attribute that should be included in every route advertised to an IBGP or confederation-EBGP peer. It is used to influence the flow of outbound traffic from the AS. The local preference is a 32-bit value and higher values are more preferred by the BGP decision process. The LOCAL_PREF attribute is not included in routes advertised to EBGP peers (if the attribute is received from an EBGP peer it is ignored).

In SR OS the default local preference is 100 but this can be changed with the local-preference command or using route policies. When a LOCAL_PREF attribute needs to be added to a route because it does not have one (for example, because it was received from an EBGP peer) the value is the configured or default local-preference unless overridden by policy.

5.5.6. Route Aggregation Path Attributes

An aggregate route is a configured IP route that is activated and installed in the routing table when it has at least one contributing route. A route (R) contributes to an aggregate route (S1) if all of the following conditions are true:

  1. the prefix length of (R) is greater than the prefix length of (S1)
  2. the prefix bits of (R) match the prefix bits of (S1) up to the prefix length of (S1)
  3. there is no other active aggregate route (S2) with a longer prefix length than (S1) that meets the previous two conditions
  4. (R) is actively used for forwarding and is not an aggregate route
  5. (R) is accepted by the route policy that is associated with (S1); if there is no configured route policy then (R1) is by default considered accepted

When an aggregate route is activated by a router, it is not installed in the forwarding table by default. In general though, it is advisable to specify the black-hole next-hop option for an aggregate route, so that when it is activated it is installed in the forwarding table with a black-hole next-hop; this avoids the possibility of creating a routing loop. SR OS also supports the option to program an aggregate route into the forwarding table with an indirect next-hop; in this case, packets matching the aggregate route but not a more-specific contributing route are forwarded towards the indirect next-hop rather than discarded.

An active aggregate route can be advertised to a BGP peer (by exporting it into BGP) and this can avoid the need to advertise the more-specific contributing routes to the peer, reducing the number of routes in the peer AS and improving overall scalability. When a router advertises an aggregate route to a BGP peer the attributes in the route are set as follows:

  1. The ATOMIC_AGGREGATE attribute is included in the route if at least one contributing route has the ATOMIC_AGGREGATE attribute or the aggregate route was formed without the as-set option and at least one contributing route has a non-empty AS_PATH. The ATOMIC_AGGREGATE attribute indicates that some of the AS numbers present in the AS paths of the contributing routes are missing from the advertised AS_PATH.
  2. The AGGREGATOR attribute is added to the route. This attribute encodes, by default, the global AS number (or confederation ID) and router ID (BGP identifier) of the router that formed the aggregate, but these values can be changed on a per aggregate route basis using the aggregator command option. The AS number in the AGGREGATOR attribute is either 2 bytes or 4 bytes (if the 4-octet ASN capability was announced by both peers). The router ID in the aggregate routes advertised to a particular set of peers can be set to 0.0.0.0 using the aggregator-id-zero command.
  3. The BGP next-hop is set to the local-address used with the peer receiving the route regardless of the BGP next-hops of the contributing routes.
  4. The ORIGIN attribute is based on the ORIGIN attributes of the contributing routes as described in RFC 4271.
  5. The information in the AS_PATH attribute depends on the as-set option of the aggregate route.
    1. If the as-set option is not specified the AS_PATH of the aggregate route starts as an empty AS path and has elements added per the description in AS Path.
    2. If the as-set option is specified and all the contributing routes have the same AS_PATH then the AS_PATH of the aggregate route starts with that common AS_PATH and has elements added per the description in AS Path.
    3. If the as-set option is specified and some of the contributing routes have different AS paths the AS_PATH of the aggregate route starts with an AS_SET and/or an AS_CONFED_SET and then adds elements per the description in AS Path.
  6. The COMMUNITY attribute contains all of the communities from all of the contributing routes unless the discard-component-communities option is configured for the aggregate route. It also contains the communities associated directly with the aggregate route itself (up to 12 per aggregate route).
  7. No MED attribute is included by default.
    Note:

    SR OS does not require all the contributing routes to have the same MED value.

5.5.7. Community Attributes

A BGP route can be associated with one or more communities. There are three kinds of BGP communities:

  1. standard communities (each 4 bytes in length, all packed into a path attribute with type code 8)
  2. extended communities (each 8 bytes in length, potentially many possible subtypes, all packed into a path attribute with type code 16)
  3. large communities (each 12 bytes in length, all packet into a path attribute with type code 32)

5.5.7.1. Standard Communities

In a standard community, the first two bytes usually encode the AS number of the administrative entity that assigned the value in the last two bytes. In SR OS, a standard community value is configured using the format <asnum:comm-value>; the colon is a required separator character. In route policy applications, multiple standard community values can be matched with a regular expression in the format <regex1>:<regex2>, where regex1 and regex2 are two regular expressions that are evaluated one numerical digit at a time.

The following well-known standard communities are understood and acted upon accordingly by SR OS routers.

  1. NO_EXPORT: When a route carries this community, it must not be advertised outside a confederation boundary (for example, to EBGP peers).
  2. NO_ADVERTISE: When a route carries this community, it must not be advertised to any other BGP peer.
  3. NO_EXPORT_SUBCONFED: When a route carries this community, it must not be advertised outside a member AS boundary (for example, to confed-EBGP peers or EBGP peers.
  4. LLGR_STALE: When a route carries this community, it indicates that the route was propagated by a router that is a long-lived graceful restart helper and normally (in the absence of LLGR) the route would have been withdrawn.
  5. NO_LLGR: When a route carries this community, it indicates that the route should not be retained and used past the normal graceful restart window of time.
  6. BLACKHOLE: When a route carries this community, it indicates that the route should be installed into the FIB with a blackhole next-hop.

Standard communities can be added to or removed from BGP routes using BGP import and export policies. When a BGP route is locally originated by exporting a static or aggregate route into BGP, and the static or aggregate route has one or more standard communities, these community values are automatically added to the BGP route. This may affect the advertisement of the locally originated route if one of the well-known communities is associated with the static or aggregate route.

To remove all the standard communities from all routes advertised to a BGP peer, use the disable-communities standard command.

5.5.7.2. Extended Communities

Extended communities serve specialized roles. Each extended community is eight bytes. The first one or two bytes identifies the type or sub-type and the remaining six or seven bytes identify a value. Some of the more common extended communities supported by SR OS include:

  1. Transitive 2-octet AS-specific
    1. Route target (type 0x0002)
    2. Route origin (type 0x0003)
    3. OSPF domain ID (type 0x0005)
    4. Source AS (type 0x0009)
    5. L2VPN identifier (type 0x000A)
  2. Non-transitive 2-octet AS-specific
    1. Link bandwidth (0x4004)
  3. Transitive 4-octet AS-specific
    1. Route target (type 0x0202)
    2. Route origin (type 0x0203)
    3. OSPF domain ID (type 0x0205)
    4. Source AS (type 0x0209)
  4. Transitive IPv4-address-specific
    1. Route target (type 0x0102)
    2. Route origin (type 0x0103)
    3. OSPF domain ID (type 0x0105)
    4. L2VPN identifier (type 0x010A)
    5. VRF route import (type 0x010B)
  5. Transitive opaque
    1. OSPF route type (type 0x0306)
    1. Color extended community (type 0x030B)
  6. Non-transitive opaque
    1. BGP origin validation state (type 0x4300)
  7. Transitive experimental
    1. FlowSpec traffic rate (type 0x8006)
    1. FlowSpec traffic action (type 0x8007)
    1. FlowSpec redirect (type 0x8008)
    1. FlowSpec traffic-remarking (0x8009)
    1. Layer 2 info (type 0x800A)
  8. Transitive FlowSpec
    1. FlowSpec interface-set (type 0x0702)
  9. Non-transitive FlowSpec
    1. FlowSpec interface-set (type 0x4702)

Extended communities can be added to or removed from BGP routes using BGP import and export policies. When a BGP route is locally originated by exporting a static or aggregate route into BGP, and the static or aggregate route has one or more extended communities, these community values are automatically added to the BGP route.

Note:

While it may not make sense to add certain types of extended communities to routes of certain address families, SR OS allows such actions.

To remove all the extended communities from all routes advertised to a BGP peer, use the disable-communities extended command.

5.5.7.3. Large Communities

Each large community is a 12-byte value, formed from the logical concatenation of three 4-octet values: a Global Administrator part, a Local Data part 1, and Local Data part 2. The Global Administrator is a four-octet namespace identifier, which should be an Autonomous System Number assigned by IANA. The Global Administrator field is intended to allow different Autonomous Systems to define large communities without collision. Local Data Part 1 is a four-octet operator-defined value and Local Data Part 2 is another four-octet operator-defined value.

In SR OS, a large community value is configured using the format <ext-asnum>:<ext-comm-val>:<ext-comm-val>; the colon is a required separator character between each of the 4-byte values. In route policy applications, it is possible to match multiple large community values with a regular expression in the format <regex1>&<regex2>&<regex3>, where regex1, regex2 and regex3 are three regular expressions, each evaluated one numerical (decimal) digit at a time.

Large communities can be added to or removed from BGP routes using BGP import and export policies. When a BGP route is locally originated by exporting a static or aggregate route into BGP, and the static or aggregate route has one or more large communities, these community values are automatically added to the BGP route.

To remove all the standard communities from all routes advertised to a BGP peer, use the disable-communities large command.

5.5.8. Route Reflection Attributes

The ORIGINATOR_ID and CLUSTER_LIST are optional non-transitive attributes that play a role in route reflection, as described in the section titled Route Reflection.

5.5.9. Multi-Protocol BGP Attributes

As discussed in the BGP chapter overview the uses of BGP have increased well beyond Internet IPv4 routing due to its support for multi-protocol extensions, or more simply MP-BGP. MP-BGP allows BGP peers to exchange routes for NLRI other than IPv4 prefixes - for example IPv6 prefixes, Layer 3 VPN routes, Layer 2 VPN routes, FlowSpec rules, and so on. A BGP router that supports MP-BGP indicates the types of routes it wants to exchange with a peer by including the corresponding AFI (Address Family Identifier) and SAFI (Subsequent Address Family Identifier) values in the MP-BGP capability of its OPEN message. The two peers forming a session do not need to indicate support for the same address families. As long as there is one AFI/SAFI in common the session establishes and routes associated with all the common AFI/SAFI can be exchanged between the peers.

The list of AFI/SAFI advertised in the MP-BGP capability is controlled entirely by the family commands. The AFI/SAFI supported by the SR OS and the method of configuring the AFI/SAFI support is summarized in Table 5.

Table 5:  Multi-Protocol BGP Support in SR OS 

Name

AFI

SAFI

Configuration Commands

IPv4 unicast

1

1

family ipv4

IPv4 multicast

1

2

family mcast-ipv4

IPv4 labeled unicast

1

4

family label-ipv4

NG-MVPN IPv4

1

5

family mvpn-ipv4

MDT-SAFI

1

66

family mdt-safi

VPN-IPv4

1

128

family vpn-ipv4

VPN-IPv4 multicast

1

129

family mcast-vpn-ipv4

RT constrain

1

132

family route-target

IPv4 flow-spec

1

133

family flow-ipv4

IPv6 unicast

2

1

family ipv6

IPv6 multicast

2

2

family mcast-ipv6

IPv6 labeled unicast

2

4

family label-ipv6

NG-MVPN IPv6

2

5

family mvpn-ipv6

VPN-IPv6

2

128

family vpn-ipv6

IPv6 flow-spec

2

133

family flow-ipv6

Multi-segment PW

25

6

family ms-pw

L2 VPN

25

65

family l2-vpn

EVPN

25

70

family evpn

To advertise reachable routes of a particular AFI/SAFI a BGP router includes a single MP_REACH_NLRI attribute in the UPDATE message. The MP_REACH_NLRI attribute encodes the AFI, the SAFI, the BGP next-hop and all the reachable NLRI. To withdraw routes of a particular AFI/SAFI a BGP router includes a single MP_UNREACH_NLRI attribute in the UPDATE message. The MP_UNREACH_NLRI attribute encodes the AFI, the SAFI and all the withdrawn NLRI. While it is valid to advertise and withdraw IPv4 unicast routes using the MP_REACH_NLRI and MP_UNREACH_NLRI attributes, SR OS always uses the IPv4 fields of the UPDATE message to convey reachable and unreachable IPv4 unicast routes.

5.5.10. 4-Octet AS Attributes

The AS4_PATH and AS4_AGGREGATOR path attributes are optional transitive attributes that support the gradual migration of routers that can understand and parse 4-octet ASN numbers. The use of these attributes is discussed in the section titled 4-Octet Autonomous System Numbers.

5.5.11. AIGP Metric

The accumulated IGP (AIGP) metric is an optional non-transitive attribute that can be attached to selected routes (using route policies) to influence the BGP decision process to prefer BGP paths with a lower end-to-end IGP cost, even when the compared paths span more than one AS or IGP instance. AIGP is different from MED in several important ways:

  1. AIGP is not intended to be transitive between completely distinct autonomous systems (only across internal AS boundaries)
  2. AIGP is always compared in paths that have the attribute, regardless of whether or not they come from different neighbor AS
  3. AIGP is more important than MED in the BGP decision process (see the section titled BGP Decision Process)
  4. AIGP is automatically incremented every time there is a BGP next-hop change so that it can track the end-to-end IGP cost. All arithmetic operations on MED attributes must be done manually (for example, using route policies).

In the SR OS implementation, AIGP is supported only in the base router BGP instance and only for the following types of routes: IPv4, label-IPv4, IPv6 and label-IPv6. The AIGP attribute is only sent to peers configured with the aigp command. If the attribute is received from a peer that is not configured for aigp or if the attribute is received in a non-supported route type the attribute is discarded and not propagated to other peers (but it is still displayed in BGP show commands).

When a 7450, 7750, or 7950 router receives a route with an AIGP attribute and it re-advertises the route to an AIGP-enabled peer without any change to the BGP next-hop the AIGP metric value is unchanged by the advertisement (RIB-OUT) process. But if the route is re-advertised with a new BGP next-hop the AIGP metric value is automatically incremented by the route table (or tunnel table) cost to reach the received BGP next-hop and/or by a statically configured value (using route policies).

5.6. BGP Routing Information Base (RIB)

The entire set of BGP routes learned and advertised by a BGP router make up its BGP Routing Information Base (RIB). Conceptually the BGP RIB can be divided into 3 parts:

  1. RIB-IN
  2. LOC-RIB
  3. RIB-OUT

The RIB-IN (or Adj-RIBs-In as defined in RFC 4271) holds the BGP routes that were received from peers and that the router decided to keep (store in its memory).

The LOC-RIB contains modified versions of the BGP routes in the RIB-IN. The path attributes of a RIB-IN route can be modified using BGP import policies. All of the LOC-RIB routes for the same NLRI are compared in a procedure called the BGP decision process that results in the selection of the best path for each NLRI. The best paths in the LOC-RIB are the ones that are actually ‘usable’ by the local router for forwarding, filtering, auto-discovery, and so on.

The RIB-OUT (or Adj-RIBs-Out as defined in RFC 4271) holds the BGP routes that were advertised to peers. Normally a BGP route is not advertised to a peer (in the RIB-OUT) unless it is ‘used’ locally but there are exceptions. BGP export policies modify the path attributes of a LOC-RIB route to create the path attributes of the RIB-OUT route. A particular LOC-RIB route can be advertised with different path attribute values to different peers so there can exist a 1:N relationship between LOC-RIB and RIB-OUT routes.

The following sections describe many important BGP features in the context of the RIB architecture outlined above.

5.6.1. RIB-IN Features

SR OS implements the following features related to RIB-IN processing:

  1. UPDATE message fault tolerance. This is described in the section titled UPDATE Message Error Handling.
  2. BGP import policies

5.6.1.1. BGP Import Policies

The import command is used to apply one or more policies (up to 15) to a neighbor, group or to the entire BGP context. The import command that is most-specific to a peer is the one that is applied. An import policy command applied at the neighbor level takes precedence over the same command applied at the group or global level. An import policy command applied at the group level takes precedence over the same command specified on the global level. The import policies applied at different levels are not cumulative. The policies listed in an import command are evaluated in the order in which they are specified.

Note:

The import command can reference a policy before it has been created (as a policy-statement).

When an IP route is rejected by an import policy it is still maintained in the RIB-IN so that a policy change can be made later on without requiring the peer to re-send all its RIB-OUT routes. This is sometimes called soft reconfiguration inbound and requires no special configuration in SR OS.

When a VPN route is rejected by an import policy or not imported by any services it is deleted from the RIB-IN. For VPN-IPv4 and VPN-IPv6 routes this behavior can be changed by configuring the mp-bgp-keep command; this option maintains rejected VPN-IP routes in the RIB-IN so that a Route Refresh message does not have to be issued when there is an import policy change.

5.6.2. LOC-RIB Features

SR OS implements the following features related to LOC-RIB processing.

  1. BGP decision process
  2. BGP route installation in the route table
  3. BGP route installation in the tunnel table
  4. BGP fast reroute
  5. QoS Policy Propagation via BGP (QPPB)
  6. Policy accounting
  7. Route flap damping (RFD)

These features are discussed in the following sections.

5.6.2.1. BGP Decision Process

When a BGP router has multiple paths in its RIB for the same NLRI, its BGP decision process is responsible for deciding which path is the best. The best path can be used by the local router and advertised to other BGP peers.

On 7450 ESS, 7750 SR, and 7950 XRS routers, the BGP decision process orders received paths based on the following sequence of comparisons. If there is a tie between paths at any step, BGP proceeds to the next step.

  1. Select a valid route over an invalid route. If a BGP route is invalid because its next-hop it is not resolved then it may still be advertisable if there are no valid routes. For example, an unresolved route can be reflected by a route reflector if it is not trying to set next-hop-self.
  2. Prefer a route for which disable-route-table-install does not apply over a route for which disable-route-table-install has been specified.
  3. Prefer a non-stale route over a stale route (in the context of long-lived graceful restart).
  4. A default route generated by the send-default command is less preferred than a default route programmed by other means.
  5. Select the route with the lowest origin validation state, where Valid<Not-Found<Invalid.
  6. Select the route with the numerically lowest route-table preference. For VPN-IP routes this also consider the number of VPRNs that imported the route.
  7. Select the route with the highest local preference.
  8. Select the route with an AIGP metric. If they both have an AIGP metric, select the route with the lowest sum of:
    1. AIGP metric value stored with the LOC-RIB copy of the route
    2. route-table or tunnel-table cost between the calculating router and the BGP next-hop in the received route
  9. Select the route with the shortest AS path. AS numbers in AS_CONFED_SEQ and AS_CONFED_SET elements do not count towards the AS path length. This step is skipped if as-path-ignore is configured for the address family.
  10. Select the route with the lowest Origin (IGP<EGP<Incomplete).
  11. Select the route with the lowest MED. Only compare MED for non-imported routes that have the same neighbor AS by default. A missing MED attribute is considered equivalent to a MED value 0 by default. Defaults can be changed by using the always-compare-med command.
  12. Select the route with the lowest owner type (BGP-label < BGP < BGP-VPN).
  13. Prefer routes learned from EBGP peers over routes learned from IBGP and confed-EBGP peers.
  14. Select the route with the lowest route-table or tunnel-table cost to the NEXT_HOP. This step is skipped if ignore-nh-metric is configured, or if the routes are associated with different RIBs. For VPN-IP routes received by a router without any configured VPRN services, next-hop cost is determined from the route-table cost.
  15. Select the route with the lowest next-hop type. Resolutions made in the route table are preferred to resolutions made in the tunnel-table. This step is skipped if ignore-nh-metric is configured, or if the routes are associated with different RIBs.
  16. Select the route received from the peer with the lowest router ID; this comes from the ORIGINATOR_ID attribute (if present) or the BGP identifier of the peer (received in its OPEN message). If ignore-router-id is configured, keep the current best path and skip the remaining steps.
  17. Select the route with the shortest CLUSTER_LIST length.
  18. Select the route received from the peer with the lowest IP address.
  19. For VPN-IP routes imported into a VPRN, select the route with the lowest route-distinguisher value.

5.6.2.2. BGP Route Installation in the Route Table

Each BGP RIB with IP routes (unlabeled IPv4, labeled-unicast IPv4, unlabeled IPv6, and labeled-unicast IPv6) submits its best path for each prefix to the common IP route table, unless the disable-route-table-install command is configured or the selective-label-ipv4-install command has prevented the installation. The best path is selected by the BGP decision process. The default preference for BGP routes submitted by the label-IPv4 and label-IPv6 RIBs (these appear in the route table and FIB as having a BGP-LABEL protocol type) can be modified by using the label-preference command. The default preference for BGP routes submitted by the unlabeled IPv4 and IPv6 RIBs can be modified by using the preference command.

Note:

The BGP instance level disable-route-table-install command can be configured on control-plane route reflectors that are not involved in packet forwarding (that is, those that do not modify the BGP next hop). This command improves performance and scalability. The disable-route-table-install policy action can be applied to BGP routes matching a peer import policy to conserve FIB space on a router that is in the datapath, for example, a router that should advertise BGP routes with itself as next hop even though it has not installed those routes into its own forwarding table.

If a BGP RIB has multiple BGP paths for the same IPv4 or IPv6 prefix that qualify as the best path up to a certain point in the comparison process, then a certain number of these multi-paths can be submitted to the common IP route table. This is called BGP multi-path and must be explicitly enabled using one or more commands in the multi-path context. These commands specify the maximum number of BGP paths, including the overall best path, that each BGP RIB can submit to the route table for any particular IPv4 or IPv6 prefix. If ECMP, with a limit of n, is enabled in the base router instance, then up to n paths are selected for installation in the IP FIB. In the data-path, traffic matching the IP route is load-shared across the ECMP next hops based on a per-packet hash calculation.

By default, the hashing is not sticky, meaning that when one or more of the ECMP BGP next hops fail, all traffic flows matching the route are potentially moved to new BGP next hops. If required, a BGP route can be marked (using the sticky-ecmp action in route policies) for sticky ECMP behavior so that BGP next hop failures are handled by moving only the affected traffic flows to the remaining next hops as evenly as possible. If new ECMP BGP next hops become available for a marked BGP, then route flows are moved as evenly as possible onto the resultant set of next hops.

Each sticky ECMP route utilizes 64 distribution buckets in order to apportion flows onto the available next hops. Figure 20, Figure 21, and Figure 22 provide an example of the distribution of flows over multiple BGP next hops as next hops are removed.

Figure 20:  Sticky ECMP Flow Distribution as Next Hops are Removed Part 1 
Figure 21:  Sticky ECMP Flow Distribution as Next Hops are Removed Part 2 
Figure 22:  Sticky ECMP Flow Distribution as Next Hops are Removed Part 3 

Table 6 lists the sticky ECMP flow distribution as next hops are removed for 1.1.1.1/32.

Table 6:  Sticky ECMP Flow Distribution as Next Hops are Removed for 1.1.1.1/32   

Initial Sticky ECMP Distribution for 1.1.1.1/32 in Figure 20

ECMP Distribution for 1.1.1.1/32 if Next Hop 3 Fails in Figure 21

ECMP Distribution for 1.1.1.1/32 if Next Hop 2 Subsequently Fails in Figure 22

Bucket

NH

Bucket

NH

Bucket

NH

00

1

00

1

00

1

01

2

01

2

01

1

02

3

02

1

02

1

03

1

03

1

03

1

04

2

04

2

04

1

05

3

05

2

05

1

06

1

06

1

06

1

07

2

07

2

07

1

08

3

08

1

08

1

09

1

09

1

09

1

10

2

10

2

10

1

11

3

11

2

11

1

12

1

12

1

12

1

13

2

13

2

13

1

14

3

14

1

14

1

15

1

15

1

15

1

16

2

16

2

16

1

17

3

17

2

17

1

18

1

18

1

18

1

19

2

19

2

19

1

20

3

20

1

20

1

21

1

21

1

21

1

22

2

22

2

22

1

23

3

23

2

23

1

24

1

24

1

24

1

25

2

25

2

25

1

26

3

26

1

26

1

27

1

27

1

27

1

28

2

28

2

28

1

29

3

29

2

29

1

30

1

30

1

30

1

31

2

31

2

31

1

32

3

32

1

32

1

33

1

33

1

33

1

34

2

34

2

34

1

35

3

35

2

35

1

36

1

36

1

36

1

37

2

37

2

37

1

38

3

38

1

38

1

39

1

39

1

39

1

40

2

40

2

40

1

41

3

41

2

41

1

42

1

42

1

42

1

43

2

43

2

43

1

44

3

44

1

44

1

45

1

45

1

45

1

46

2

46

2

46

1

47

3

47

2

47

1

48

1

48

1

48

1

49

2

49

2

49

1

50

3

50

1

50

1

51

1

51

1

51

1

52

2

52

2

52

1

53

3

53

2

53

1

54

1

54

1

54

1

55

2

55

2

55

1

56

3

56

1

56

1

57

1

57

1

57

1

58

2

58

2

58

1

59

3

59

2

59

1

60

1

60

1

60

1

61

2

61

2

61

1

62

3

62

1

62

1

63

1

63

1

63

1

Figure 23, Figure 24, and Figure 25 provide an example of the distribution of flows over multiple BGP next hops as next hops are added.

Figure 23:  Sticky ECMP Flow Distribution as Next Hops are Added Part 1 
Figure 24:  Sticky ECMP Flow Distribution as Next Hops are Added Part 2 
Figure 25:  Sticky ECMP Flow Distribution as Next Hops are Added Part 3 

Table 7 lists the sticky ECMP flow distribution as next hops are added for 1.1.1.1/32.

Table 7:  Sticky ECMP Flow Distribution as Next Hops are Added for 1.1.1.1/32  

Initial Sticky ECMP Distribution for 1.1.1.1/32 in Figure 23

ECMP Distribution for 1.1.1.1/32 if Next Hop 2 Becomes Available in Figure 24

ECMP Distribution for 1.1.1.1/32 if Next Hop 3 Additionally Becomes Available in Figure 25

Bucket

NH

Bucket

NH

Bucket

NH

00

1

00

1

00

1

01

1

01

2

01

2

02

1

02

1

02

3

03

1

03

2

03

2

04

1

04

1

04

1

05

1

05

2

05

3

06

1

06

1

06

1

07

1

07

2

07

2

08

1

08

1

08

3

09

1

09

2

09

2

10

1

10

1

10

1

11

1

11

2

11

3

12

1

12

1

12

1

13

1

13

2

13

2

14

1

14

1

14

3

15

1

15

2

15

2

16

1

16

1

16

1

17

1

17

2

17

3

18

1

18

1

18

1

19

1

19

2

19

2

20

1

20

1

20

3

21

1

21

2

21

2

22

1

22

1

22

1

23

1

23

2

23

3

24

1

24

1

24

1

25

1

25

2

25

2

26

1

26

1

26

3

27

1

27

2

27

2

28

1

28

1

28

1

29

1

29

2

29

3

30

1

30

1

30

1

31

1

31

2

31

2

32

1

32

1

32

3

33

1

33

2

33

2

34

1

34

1

34

1

35

1

35

2

35

3

36

1

36

1

36

1

37

1

37

2

37

2

38

1

38

1

38

3

39

1

39

2

39

2

40

1

40

1

40

1

41

1

41

2

41

3

42

1

42

1

42

1

43

1

43

2

43

2

44

1

44

1

44

3

45

1

45

2

45

2

46

1

46

1

46

1

47

1

47

2

47

3

48

1

48

1

48

1

49

1

49

2

49

2

50

1

50

1

50

3

51

1

51

2

51

2

52

1

52

1

52

1

53

1

53

2

53

3

54

1

54

1

54

1

55

1

55

2

55

2

56

1

56

1

56

3

57

1

57

2

57

2

58

1

58

1

58

1

59

1

59

2

59

3

60

1

60

1

60

1

61

1

61

2

61

2

62

1

62

1

62

3

63

1

63

2

63

2

A BGP route to an IPv4 or IPv6 prefix is a candidate for installation as an ECMP next hop only if it meets all of the following criteria:

  1. The multi-path route must be the same type of route as the best path (same AFI/SAFI and, in some cases, same next-hop resolution method).
  2. The multi-path route must be tied with the best path for all criteria of greater significance than next-hop cost, except for criteria that are configured to be ignored.
  3. If the best path selection reaches the next-hop cost comparison, the multi-path route must have the same next-hop cost as the best route unless the unequal-cost option is configured.
  4. The multi-path route must not have the same BGP next hop as the best path or any other multi-path route.
  5. The multi-path route must not cause the ECMP limit of the routing instance to be exceeded (configured using the ecmp command with a value in the range 1 to 64).
  6. The multi-path route must not cause the applicable max-paths limit to be exceeded. If the best path is an EBGP learned route and the ebgp option is used, the ebgp-max-paths limit overrides the max-paths limit. If the best path is an IBGP-learned route and the ibgp option is used, the ibgp-max-paths limit overrides the max-paths limit. All path limits are configurable up to a maximum of 64. Multi-path is disabled if the value is set to 1.
  7. The multi-path route must have the same neighbor AS in its AS path as the best path if the restrict same-neighbor-as option is configured. By default, any path with the same AS path length as the best path (regardless of neighbor AS) is eligible for multi-path.
  8. The route must have the same AS path as the best path if the restrict exact-as-path option is configured. By default, any path with the same AS path length as the best path (regardless of the actual AS numbers) is eligible for multi-path.

SR OS also supports IBGP multi-path. In some topologies, a BGP next hop is resolved by an IP route that has multiple ECMP next hops. When ibgp-multipath is not configured, only one of the ECMP next hops is programmed as the next hop of the BGP route in the IOM. When ibgp-multipath is configured, the IOM attempts to use all the ECMP next hops of the resolving route in the forwarding state. Although the name of the ibgp-multipath command implies that it is specific to IBGP-learned routes, this is not the case. It also applies to routes learned from any multi-hop BGP session including routes learned from multi-hop EBGP peers.

Be aware that multi-path and ibgp-multipath are not mutually exclusive and work together. The first context enables ECMP load-sharing across different BGP next hops (corresponding to different BGP routes) while the ibgp-multipath enables ECMP load-sharing across the next hops of IP routes that resolve the BGP next hops.

Finally, ibgp-multipath does not control traffic load sharing toward a BGP next hop that is resolved by a tunnel, as when dealing with BGP shortcuts or labeled routes (VPN-IP, label-IPv4, or label-IPv6). When a BGP next hop is resolved by a tunnel that supports ECMP, the load-sharing of traffic across the ECMP next hops of the tunnel is automatic.

SR OS supports direct resolution of a BGP next hop to multiple RSVP-TE or SR-TE tunnels. In addition, a BGP next hop can be resolved by multiple LDP ECMP next hops that each correspond to a separate LDP-over-RSVP or LDP-over-SRTE tunnel. It is also possible for a BGP next hop to be resolved by an IGP shortcut route that has multiple RSVP-TE or SR-TE tunnels as its ECMP next hops.

5.6.2.3. Weighted ECMP for BGP Routes

In some cases, the ECMP BGP next-hops of an IP route correspond to paths with very different bandwidths and it makes sense for the ECMP load-balancing algorithm to distribute traffic across the BGP next-hops in proportion to their relative bandwidths. The bandwidth associated with a path can be signaled to other BGP routers by including a link-bandwidth extended community in the BGP route. The link-bandwidth extended community is optional and non-transitive and encodes an autonomous system (AS) number and a bandwidth.

The SR OS implementation supports the link-bandwidth extended community in routes associated with the following address families: IPv4, IPv6, label-IPv4, label-IPv6, VPN-IPv4, and VPN-IPv6. The router automatically performs weighted ECMP for an IP BGP route when all of the ECMP BGP next-hops of the route include a link-bandwidth extended community. The relative weight of traffic sent to each BGP next-hop is visible in the output of the show router route-table extensive and show router fib extensive commands.

A route with a link-bandwidth extended community can be received from any IBGP peer. If such a route is received from an EBGP peer, the link-bandwidth extended community is stripped from the route unless an accept-from-ebgp command applies to that EBGP peer. However, a link-bandwidth extended community can be added to routes received from a directly connected (single hop) EBGP peer, potentially replacing the received Extended Community. This is accomplished using the add-to-received-ebgp command, which is available in group and neighbor configuration contexts.

When a route with a link-bandwidth extended community is advertised to an EBGP peer, the link-bandwidth extended community is removed by default. However, transitivity across an AS boundary can be allowed by configuring the send-to-ebgp command.

When a route with a link-bandwidth extended community is advertised to a peer using next-hop-self, the Extended Community is usually removed if it was not added locally (that is, by policy or add-to-received-ebgp command). However, in the special case that a route is readvertised (with next-hop-self) toward a peer covered by the scope of an aggregate-used-paths command, and the re-advertising router has installed multiple ECMP paths toward the destination each associated with a link-bandwidth extended community, the route is readvertised with a link-bandwidth extended community encoding the total bandwidth of all the used multi-paths.

The link-bandwidth extended community associated with a BGP route can be displayed using the show router bgp routes command. For the bandwidth value, the system automatically converts the binary value in the extended community to a decimal number in units of Mb/s (1 000 000 b/s).

Weighted ECMP across the BGP next-hops of an IP BGP route is supported in combination with ECMP at the level of the route or tunnel that resolves one or more of the ECMP BGP next-hops. This ECMP at the resolving level can also be weighted ECMP when the following conditions all apply:

  1. The BGP next-hop is resolved by an IP route (OSPF, IS-IS, or static) with MPLS LSP ECMP next-hops
  2. ibgp-multipath is configured under BGP
  3. config>router>weighted-ecmp is configured

5.6.2.4. BGP Route Installation in the Tunnel Table

Received label-unicast routes can be installed by BGP as tunnels in the tunnel table. In SR OS, the tunnel table is used to resolve a BGP next-hop to a tunnel when required by the configuration or the type of route (see Next-Hop Resolution). BGP tunnels play a key role in the following solutions:

  1. inter-AS model C
  2. Carrier Supporting Carrier (CSC)
  3. seamless MPLS

BGP tunnels have a preference of 10 in the tunnel table, compared to 9 for LDP tunnels and 7 for RSVP tunnels. If the router configuration allows all types of tunnels to resolve a BGP next-hop, an RSVP LSP is preferred over an LDP tunnel, and an LDP tunnel is preferred over a BGP tunnel.

Further details about BGP-LU tunnels depending on the address family, are described below.

5.6.2.4.1. Label-IPv4 Tunnels

A label-IPv4 is automatically added as a BGP tunnel entry to the tunnel table if all of the following conditions are met:

  1. the label-IPv4 route is the best BGP path for the /32 IPv4 prefix
  2. the label-IPv4 route has the numerically lowest preference value among all routes (regardless of the protocol) for the /32 IPv4 prefix
  3. the disable-route-table-install command does not apply to the route
  4. the selective-label-ipv4-install command does not prevent the installation of the route

If multipath and ECMP are configured so that they apply to label IPv4 routes, then a BGP tunnel can be installed in the tunnel table with multiple ECMP next-hops, each one corresponding to a path through a different BGP next-hop. The multipath selection process outlined in BGP Route Installation in the Route Table also applies to this case.

5.6.2.4.2. Label-IPv6 Tunnels

A label-IPv6 is automatically added as a BGP tunnel entry to the tunnel table if all of the following conditions are met.

  1. the label-IPv6 route is the best BGP path for a /128 IPv6 prefix is a label-IPv6 route (AFI 2, SAFI 4)
  2. the label-IPv6 route has the numerically lowest preference value among all routes (regardless of protocol) for the /128 IPv6 prefix
  3. the disable-route-table-install command does not apply to the route
  4. the disable-explicit-null command is configured

If multipath and ECMP are configured so that they apply to label IPv6 routes, a BGP tunnel can be installed in the tunnel table with multiple ECMP next-hops, each one corresponding to a path through a different BGP next-hop. However, when disable-explicit-null is configured, the label-IPv6 routes used for ECMP toward an IPv6 destination cannot be a mix of routes with regular label values and routes with special (IPv6 explicit null) label values.

5.6.2.5. BGP Fast Reroute

BGP fast reroute is a feature that brings together indirection techniques in the forwarding plane and pre-computation of BGP backup paths in the control plane to support fast reroute of BGP traffic around unreachable/failed BGP next-hops. BGP fast reroute is supported with IPv4, label-IPv4, IPv6, label-IPv6, VPN-IPv4 and VPN-IPv6 routes. The scenarios supported by the base router BGP context are outlined in Table 8.

Refer to the VPRN section of the 7450 ESS, 7750 SR, 7950 XRS, and VSR Layer 3 Services Guide: IES and VPRN for more information about BGP fast reroute information specific to IP VPNs.

Table 8:  BGP Fast Reroute Scenarios (Base Context) 

Ingress Packet

Primary Route

Backup Route

Prefix Independent Convergence

IPv4

IPv4 route with next-hop A resolved by an IPv4 route or any shortcut tunnel

IPv4 route with next-hop B resolved by an IPv4 route or any shortcut tunnel

Yes

IPv4

Label-IPv4 route with next-hop A resolved by any transport tunnel

Label-IPv4 route with next-hop B resolved by any transport tunnel

Yes, but if the label-IPv4 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC

IPv4

Label-IPv4 route with next-hop A resolved by a local route

Label-IPv4 route with next-hop B resolved by a local route

Yes, but if the label-IPv4 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC

IPv4

Label-IPv4 route with next-hop A resolved by a static route

Label-IPv4 route with next-hop B resolved by a static route

Yes, but if the label-IPv4 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC

IPv6

IPv6 route with next-hop A resolved by an IPv6 route

IPv6 route with next-hop B resolved by an IPv6 route

Yes

IPv6

Label-IPv6 route with next-hop A resolved by any transport tunnel

Label-IPv6 route with next-hop B resolved by any transport tunnel

Yes, but if the label-IPv6 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC

IPv6

Label-IPv6 route with next-hop A resolved by a local route

Label-IPv6 route with next-hop B resolved by a local route

Yes, but if the label-IPv6 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC

IPv6

Label-IPv6 route with next-hop A resolved by a static route

Label-IPv6 route with next-hop B resolved by a static route

Yes, but if the label-IPv6 routes are label-per-prefix, the ingress card must be IOM3 or better for PIC

5.6.2.5.1. Calculating Backup Paths

In SR OS, fast reroute is optional and must be enabled by using either the BGP backup-path command or the route-policy install-backup-path command. Typically, only one approach is used.

The backup-path command in the base router context is used to control fast reroute on a per-RIB basis (IPv4, label-IPv4, IPv6, and label-IPv6). When the command specifies a particular family, BGP attempts to find a backup path for every prefix learned by the associated BGP RIB.

The install-backup-path command, available in route-policy-action contexts, marks a BGP route as requesting a backup path. It only takes effect in BGP import and VRF import policies. If only some prefixes should have backup paths, then the backup-path command should not be used, and instead the install-backup-path command should be used to mark only those prefixes that require extra protection.

In general, a prefix supports ECMP paths or a backup path, but not both. The backup path is the best path after the primary path and any paths with the same BGP next-hop as the primary path have been removed.

5.6.2.5.2. Failure Detection and Switchover to the Backup Path

When BGP fast reroute is enabled the IOM reroutes traffic onto a backup path based on input from BGP. When BGP decides that a primary path is no longer usable it notifies the IOM and affected traffic is immediately switched to the backup path.

The following events trigger failure notifications to the IOM and reroute of traffic to backup paths:

  1. Peer IP address unreachable and peer-tracking is enabled
  2. BFD session associated with BGP peer goes down
  3. BGP session terminated with peer (for example, send/receive NOTIFICATION)
  4. There is no longer any route (allowed by the next-hop resolution policy, if configured) that can resolve the BGP next-hop address
  5. The LDP tunnel that resolves the next-hop goes down. This could happen because there is no longer any IP route that can resolve the FEC, or the LDP session goes down, or the LDP peer withdraws its label mapping.
  6. The RSVP tunnel that resolves the next-hop goes down. This could happen because a ResvTear message is received, or the RESV state times out, or the outgoing interface fails and is not protected by FRR or a secondary path.
  7. The BGP tunnel that resolves the next-hop goes down. This could happen because the BGP label-IPv4 route is withdrawn by the peer or else becomes invalid due to an unresolved next-hop.

5.6.2.6. QoS Policy Propagation via BGP (QPPB)

QPPB is a feature that allows a QoS treatment (forwarding class and optionally priority) to be associated with a BGP IPv4, label-IPv4, IPv6, or label-IPv6 route that is installed in the routing table. This is done so that when traffic arrives on a QPPB-enabled IP interface and its source or destination IP address matches a BGP route with QoS information, the packet is handled according to the QoS of the matching route. SR OS supports QPPB on the following types of interfaces:

  1. Base router network interfaces
  2. IES and VPRN SAP interfaces
  3. IES and VPRN spoke SDP interfaces
  4. IES and VPRN subscriber interfaces

QPPB is enabled on an interface using the qos-route-lookup command. There are separate commands for IPv4 and IPv6 so QPPB can be enabled in one mode, source, destination, or none, for IPv4 packets arriving on the interface, and a different mode source, destination, or none, for IPv6 packets arriving on the interface.

Note:

Source-based QPPB is not supported on subscriber interfaces.

Different BGP routes for the same IP prefix can be associated with different QPPB information. If these BGP routes are combined in support of ECMP or BGP fast reroute then the QPPB information becomes next-hop specific. If these LOC-RIB routes are combined in support of ECMP or BGP fast reroute then the QPPB information becomes next-hop specific. This means that in destination QPPB mode the QoS assigned to a packet depends on the BGP next-hop that is selected for that particular packet by the ECMP hash or fast reroute algorithm. In source QPPB mode the QoS assigned to a packet comes from the first BGP next-hop of the IP route matching the source address.

5.6.2.7. BGP Policy Accounting and Policing

Policy accounting is a feature that allows “classes” to be associated with certain IPv4 or IPv6 routes, static or BGP learned, when they are installed in the routing table. This is done for the following reasons:

  1. To collect “per-interface, per-class” traffic statistics on policy accounting-enabled interfaces of the router. This is supported by all FP2 and later generation cards and systems.
  2. To implement “per-interface, per-FP, per-class” traffic policing on policy accounting-enabled interfaces of the router. This is only supported for destination classes and only by FP4 cards and systems. The rate limit is applied per-interface, per-FP, per-class, when the IP interface is a distributed interface such as R-VPLS, LAG, or spoke SDP, that spans multiple complexes. Otherwise, for a simple interface, the rate limit is applied per-interface per-class.

For both applications the following IP interface types are supported:

  1. Base router network interfaces
  2. IES and VPRN SAP interfaces
  3. IES and VPRN spoke SDP interfaces
  4. IES and VPRN subscriber interfaces (with some limitations)
  5. IES and VPRN R-VPLE interfaces

Policy accounting, and policing (if needed and supported), is enabled on an interface using the policy-accounting command. The name of a policy accounting template must be specified as an argument of this command. SR OS supports up to 1024 different templates. Each policy accounting template can have a list of source classes (up to 255), a list of destination classes (up to 255), and a list of policers (up to 63). Each source class, destination class, and policer, in their respective list, has an index number. Source class indices and destination class indices have a global meaning. In other words, destination-class index 5 in one template refers to the same set of routes as destination-class index 5 in another policy accounting template. Policer indices have a local scope to the enclosing template. In one template, destination-class index 5 could use policer index 2 and in another template destination-class index 5 could use policer index 62. If a destination class has an associated policer then incoming traffic on each IP interface on which the template is applied is rate-limited based on that policer if the destination IP address matches a route with that destination class.

Policy accounting templates containing one or more source class identifiers cannot be applied to subscriber interfaces.

The policy accounting template tells the IOM the number of statistics and policer resources to use for each interface. These resources are derived from two pools that are sized per-FP. The first pool consists of policer statistics indices. Every policy-accounting interface on a card or FP uses one of these resources for every source and destination class index listed in the template referenced by the interface. These are basic resources needed for statistics collection. The total reservation at the FP level is set using the configure card slot-number fp fp-number policy-accounting command.

The second pool (FP4 cards only) consists of policer index resources. Every policy-accounting interface on a complex uses one of these resources for every destination class associated with a policer in the template referenced by the interface. The total reservation of this second resource at the FP level is set using the configure card slot-number fp fp-number ingress policy-accounting policers command.

The total number of the above two resources, per FP, must be less than or equal to 128000. In addition, the second resource pool size must be less than or equal to the size of the first resource pool.

It is possible to increase or decrease the size of either resource sub-pool at any time. A decrease can cause some interfaces (randomly selected) to immediately lose their resources and stop counting or policing some traffic that was previously being counted or policed.

If the policy accounting is enabled on a spoke SDP or R-VPLS interface, all FPs in the system should have a reservation for each of the above resources, otherwise the show router interface policy-accounting command output reports that the statistics are possibly incomplete.

Through route policy or configuration mechanisms, a BGP or static route for an IP prefix can have a source class index (1 to 255), a destination class index (1 to 255) or both. When an ingress packet on a policy accounting-enabled interface [I1] is forwarded by the IOM and its destination address matches a BGP or static route with a destination class index [D], and [D] is listed in the relevant policy accounting template, then the packets-forwarded and IP-bytes-forwarded counters for [D] on interface [I1] are incremented accordingly. If [D] is also associated with a policer (FP4 only) the packet is also subjected to rate limiting as discussed above. The policer statistics displayed by the show router interface policy-accounting command include Layer 2 encapsulation and is different from the destination-class byte-level statistics.

When an ingress packet on a policy accounting-enabled interface [I2] is forwarded by the IOM and its source address matches a BGP or static route with a source class index [S], and [S] is listed in the relevant policy accounting template, the packets-forwarded and IP-bytes-forwarded counters for [S] on interface [I2] are incremented accordingly. Policing based on the source class is unsupported.

It is possible that different BGP or static routes for the same IP prefix (through different next hops) are associated with different class information. If these routes are combined in support of ECMP or fast reroute then the destination class of a packet depends on the next hop that is selected for that particular packet by the ECMP hash or fast reroute algorithm. If the source address of a packet matches a route with multiple next hops its source class is derived from the first next hop of the matching route.

5.6.2.8. Route Flap Damping (RFD)

Route flap damping is a mechanism supported by 7450, 7750, and 7950 routers, as well as other BGP routers, that was designed to help improve the stability of Internet routing by mitigating the impact of route flaps. Route flaps describe a situation where a router alternately advertises a route as reachable and then unreachable or as reachable through one path and then another path in rapid succession. Route flaps can result from hardware errors, software errors, configuration errors, unreliable links, and so on. However not all perceived route flaps represent a true problem; when a best path is withdrawn the next-best path may not be immediately known and may trigger a number of intermediate best path selections (and corresponding advertisements) before it is found. These intermediate best path selections may travel at different speeds through different routers due to the effect of the min-route-advertisement interval (MRAI) and other factors. RFD does not handle this type of situation particularly well and for this and other reasons many Internet service providers do not use RFD.

In SR OS route flap damping is configurable; by default, it is disabled. It can be enabled on EBGP and confed-EBGP sessions by including the damping command in their group or neighbor configuration. The damping command has no effect on IBGP sessions. When a route of any type (any AFI/SAFI) is received on a non-IBGP session that has damping enabled:

  1. If the route changes from reachable to unreachable due to a withdrawal by the peer then damping history is created for the route (if it does not already exist) and in that history the Figure of Merit (FOM), an accumulated penalty value, is incremented by 1024.
  2. If a reachable route is updated by the peer with new path attribute values then the FOM is incremented by 1024.
  3. In SR OS the FOM has a hard upper limit of 21540 (not configurable).
  4. The FOM value is decayed exponentially as described in RFC 2439. The half-life of the decay is 15 minutes by default, however a BGP import policy can be used to apply a non-default damping profile to the route, and the half-life in the non-default damping profile can have any value between 1 and 45 minutes.
  5. The FOM value at the last time of update can be displayed using the show router bgp damping detail command. The time of last update can be up to 640 seconds ago; SR OS does not calculate the current FOM every time the show command is entered.
  6. When the FOM reaches the suppress limit, which is 3000 by default, but can be changed to any value between 1 and 20000 in a non-default damping profile, the route is suppressed, meaning it is not used locally and not advertised to peers. The route remains suppressed until either the FOM exponentially decays to a value less than or equal to the reuse threshold or the max-suppress time is reached. By default, the reuse threshold is 750 and the max-suppress time is 60 minutes, but these can be changed in a non-default damping profile: reuse can have a value between 1 and 20000 and max-suppress can have a value between 1 and 720 minutes.

5.6.3. RIB-OUT Features

SR OS implements the following features related to RIB-OUT processing:

  1. BGP export policies
  2. Outbound route filtering (ORF)
  3. RT constrained route distribution
  4. Configurable min-route-advertisement (MRAI)
  5. Advertise-inactive
  6. Best-external
  7. Add-path
  8. Split-horizon

These features are discussed in the following sections.

5.6.3.1. BGP Export Policies

The export command is used to apply one or more policies (up to 15) to a neighbor, group or to the entire BGP context. The export command that is most-specific to a peer is the one that is applied. An export policy command applied at the neighbor level takes precedence over the same command applied at the group or global level. An export policy command applied at the group level takes precedence over the same command specified on the global level. The export policies applied at different levels are not cumulative. The policies listed in an export command are evaluated in the order in which they are specified.

Note:

The export command can reference a policy before it has been created (as a policy-statement).

The most common uses for BGP export policies are as follows:

  1. To locally originate a BGP route by exporting (or redistributing) a non-BGP route that is installed in the route table and actively used for forwarding. The non-BGP route is most frequently a direct, static or aggregate route (exporting IGP routes into BGP is generally not recommended).
  2. To block the advertisement of certain BGP routes towards specific BGP peers. The routes may be blocked on the basis of IP prefix, communities, and so on.
  3. To modify the attributes of BGP routes advertised to specific BGP peers. The following path attribute modifications are possible using BGP export policies:
    1. Change the ORIGIN value.
    2. Add a sequence of AS numbers to the start of the AS_PATH. When a route is advertised to an EBGP peer the addition of the local-AS/global-AS numbers to the AS_PATH is always the final step (done after export policy).
    3. Replace the AS_PATH with a new AS_PATH. When a route is advertised to an EBGP peer the addition of the local-AS/global-AS numbers to the AS_PATH is always the final step (done after export policy).
    4. Prepend an AS number multiple times to the start of the AS_PATH. When a route is advertised to an EBGP peer the addition of the local-AS/global-AS numbers to the AS_PATH is always the final step (done after export policy). The add/replace action on the AS_PATH supersedes the prepend action if both are specified in the same policy entry.
    5. Change the NEXT_HOP to a specific IP address. When a route is advertised to an EBGP peer the next-hop cannot be changed from the local-address.
    6. Change the NEXT_HOP to the local-address used with the peer (next-hop-self).
    7. Add a value to the MED. If the MED attribute does not exist it is added.
    8. Subtract a value from the MED. If the MED attribute does not exist it is added with a value of 0. If the result of the subtraction is a negative number the MED metric is set to 0.
    9. Set the MED to a particular value.
    10. Set the MED to the cost of the IP route (or tunnel) used to resolve the BGP next-hop.
    11. Set LOCAL_PREF to a particular value when advertising to an IBGP peer.
    12. Add, remove or replace standard communities.
    13. Add, remove or replace extended communities.
    14. Add a static value to the AIGP metric when advertising the route to an AIGP-enabled peer with a modified BGP next-hop. The static value is incremental to the automatic adjustment of the LOC-RIB AIGP metric to reflect the distance between the local router and the received BGP next-hop.
    15. Increment the AIGP metric by a fixed amount when advertising the route to an AIGP-enabled peer with a modified BGP next-hop. The static value is a substitute for the dynamic value of the distance between the local router and the received BGP next-hop.

5.6.3.2. Outbound Route Filtering (ORF)

Outbound route filtering (ORF) is a mechanism that allows one router, the ORF-sending router to signal to a peer, the ORF-receiving router, a set of route filtering rules (ORF entries) that the ORF-receiving router should apply to its route advertisements towards the ORF-sending router. The ORF entries are encoded in Route Refresh messages.

The use of ORF on a session must be negotiated — that is, both routers must advertise the ORF capability in their Open messages. The ORF capability describes the address families that support ORF, and for each address family, the ORF types that are supported and the ability to send/receive each type. 7450, 7750, and 7950 routers support ORF type 3, which is ORF based on Extended Communities. It is supported for only the following address families:

  1. VPN-IPv4
  2. VPN-IPv6
  3. MVPN-IPv4
  4. MVPN-IPv6

In SR OS the send/receive capability for ORF type 3 is configurable (with the send-orf and accept-orf commands) but the setting applies to all supported address families.

SR OS support for ORF type 3 allows a PE router that imports VPN routes with a particular set of Route Target Extended Communities to indicate to a peer (for example a route reflector) that it only wants to receive VPN routes that contain one or more of these Extended Communities. When the PE router wants to inform its peer about a new RT Extended Community it sends a Route Refresh message to the peer containing an ORF type 3 entry instructing the peer to add a permit entry for the 8-byte extended community value. When the PE router wants to inform its peer about a RT Extended Community that is no longer needed it sends a Route Refresh message to the peer containing an ORF type 3 entry instructing the peer to remove the permit entry for the 8-byte extended community value.

In SR OS the type-3 ORF entries that are sent to a peer can be generated dynamically (if no Route Target Extended Communities are specified with the send-orf command) or else specified statically. Dynamically generated ORF entries are based on the route targets that are imported by all locally-configured VPRNs.

A router that has installed ORF entries received from a peer can still apply BGP export policies to the session. If the evaluation of a BGP export policy results in a reject action for a VPN route that matches a permit ORF entry the route is not advertised — i.e. the export policy has the final word.

Note:

The SR OS implementation of ORF filtering is very efficient. It takes less time to filter a large number of VPN routes with ORF than it does to reject non-matching VPN routes using a conventional BGP export policy.

Despite the advantages of ORF compared to manually configured BGP export policies a better technology, when it comes to dynamic filtering based on Route Target Extended Communities, is RT Constraint. RT Constraint is discussed further in the next section.

5.6.3.3. RT Constrained Route Distribution

RT constrained route distribution, or RT-constrain for short, is a mechanism that allows a router to advertise to certain peers a special type of MP-BGP route called an RTC route; the associated AFI is 1 and the SAFI is 132. The NLRI of an RTC route encodes an Origin AS and a Route Target Extended Community with prefix-type encoding (for instance, if there is a prefix-length and “host” bits after the prefix-length are set to zero). A peer receiving RTC routes does not advertise VPN routes to the RTC-sending router unless they contain a Route Target Extended Community that matches one of the received RTC routes. As with any other type of BGP route RTC routes are propagated loop-free throughout and between Autonomous Systems. If there are multiple RTC routes for the same NLRI the BGP decision process selects one as the best path. The propagation of the best path installs RIB-OUT filter rules as it is travels from one router to the next and this process creates an optimal VPN route distribution tree rooted at the source of the RTC route.

Note:

RT-constrain and Extended Community-based ORF are similar to the extent that they both allow a router to signal to a peer the Route Target Extended Communities they want to receive in VPN routes from that peer. But RT-constrain has distinct advantages over Extended Community-based ORF: it is more widely supported, it is simpler to configure, and its distribution scope is not limited to a direct peer.

In SR OS the capability to exchange RTC routes is advertised when the route-target keyword is added to the relevant family command. RT-constrain is supported on EBGP and IBGP sessions of the base router instance. On any particular session either ORF or RT-constrain may be used but not both; if RT-constrain is configured the ORF capability is not announced to the peer.

When RT-constrain has been negotiated with one or more peers SR OS automatically originates and advertises to these peers one /96 RTC route (the origin AS and Route Target Extended Community are fully specified) for every route target imported by a locally-configured VPRN or BGP-based L2 VPN; this includes MVPN-specific route targets.

SR OS also supports a group/neighbor level default-route-target command that causes routers to generate and send a 0:0:0/0 default RTC route to one or more peers. Sending the default RTC route to a peer conveys a request to receive all VPN routes from that peer. The default-route-target command is typically configured on sessions that a route reflector has with its PE clients. A received default RTC route is never propagated to other routers.

The advertisement of RTC routes by a route reflector follows special rules that are described in RFC 4684. These rules are needed to ensure that RTC routes for the same NLRI that are originated by different PE routers in the same Autonomous System are properly distributed within the AS.

When a BGP session comes up, and RT-constrain is enabled on the session (both peers advertised the MP-BGP capability), routers delay sending any VPN-IPv4 and VPN-IPv6 routes until either the session has been up for 60 seconds or the End-of-RIB marker is received for the RT-constrain address family. When the VPN-IPv4 and VPN-IPv6 routes are sent they are filtered to include only those with a Route Target Extended Community that matches an RTC route from the peer. VPN-IP routes matching an RTC route originated in the local AS are advertised to any IBGP peer that advertises a valid path for the RTC NLRI — in other words, route distribution is not constrained to only the IBGP peer advertising the best path. On the other hand, VPN-IP routes matching an RTC route originated outside the local AS are only advertised to the EBGP or IBGP peer that advertises the best path.

Note:

SR OS does not support an equivalent of BGP-Multipath for RT-Constrain routes. There is no way to distribute VPN routes across more than one ‘almost’ equal set of inter-AS paths.

5.6.3.4. Min Route Advertisement Interval (MRAI)

According to the BGP standard (RFC 4271), a BGP router should not send updated reachability information for an NLRI to a BGP peer until a certain period of time (Min Route Advertisement Interval) has elapsed since the last update. The RFC suggests the MRAI should be configurable per peer but does not propose a specific algorithm, and therefore, MRAI implementation details vary from one router operating system to another.

In SR OS, the MRAI is configurable, on a per-session basis, using the min-route-advertisement command. The min-route-advertisement command can be configured with any value between 1 and 255 seconds and the setting applies to all address families. The default value is 30 seconds, regardless of the session type (EBGP or IBGP). The MRAI timer is started at the configured value when the session is established and counts down continuously, resetting to the configured value whenever it reaches zero. Every time it reaches zero, all pending RIB-OUT routes are sent to the peer.

To send UPDATE messages that advertise new NLRI reachability information more frequently for some address families than others, SR OS offers a rapid-update command that overrides the remaining time on a peer's MRAI timer and immediately sends routes belonging to specified address families (and all other pending updates) to the peers receiving these routes. The address families that can be configured with rapid-update support are:

  1. EVPN
  2. L2-VPN
  3. Label-IPv4
  4. Label-IPv6
  5. MCAST-VPN-IPv4
  6. MCAST-VPN-IPv6
  7. MDT-SAFI
  8. MVPN-IPv4
  9. MVPN-IPv6
  10. VPN-IPv4
  11. VPN-IPv6

In many cases, the default MRAI is appropriate for all address families (or at least those not included in the preceding list) when it applies to UPDATE messages that advertise reachable NLRI, but it is not the best option for UPDATE messages that advertise unreachable NLRI (route withdrawals). Fast re-convergence after some types of failures requires route withdrawals to propagate to other routers as quickly as possible so that they can calculate and start using new best paths, which would be impeded by the effect of the MRAI timer at each router hop. This is facilitated by the rapid-withdrawal configuration command.

When rapid-withdrawal is configured, UPDATE messages containing withdrawn NLRI are sent immediately to a peer without waiting for the MRAI timer to expire. UPDATE messages containing reachable NLRI continue to wait for the MRAI timer to expire, or for a rapid-update trigger, if it applies. When rapid-withdrawal is enabled, it applies to all address families.

When there is a change to a labeled-unicast route that requires reprogramming of the label operations in the dataplane, these IOM updates are not made until the changed route is advertised to a peer, which depends on MRAI. Lowering the MRAI value or using rapid-update improves the speed of this operation.

5.6.3.5. Advertise-Inactive

BGP does not allow a route to be advertised unless it is the best path in the RIB and an export policy allows the advertisement.

In some cases, it may be useful to advertise the best BGP path to peers despite the fact that is inactive —for example, because there are one or more preferred non-BGP routes to the same destination and one of these other routes is the active route. One way SR OS supports this flexibility is using the advertise-inactive command; other methods include Best-External and Add-Paths.

When the BGP advertise-inactive command is configured so that it applies to a BGP session it has the following effect on the IPv4, IPv6, mcast-ipv4, mcast-ipv6, label-IPv4 and label-IPv6 routes advertised to that peer:

  1. If the active route for the IP prefix is a BGP route then that route is advertised.If the active route for the IP prefix is a non-BGP route and there is at least one valid but inactive BGP route for the same destination then the best of the inactive and valid BGP routes is advertised unless the non-BGP active route is matched and accepted by an export policy applied to the session.
  2. If the active route for the IP prefix is a non-BGP route and there are no (valid) BGP routes for the same destination then no route is advertised for the prefix unless the non-BGP active route is matched and accepted by an export policy applied to the session.

5.6.3.6. Best-External

Best-External is a BGP enhancement that allows a BGP speaker to advertise to its IBGP peers its best “external” route for a prefix/NLRI when its best overall route for the prefix/NLRI is an “internal” route. This is not possible in a normal BGP configuration because the base BGP specification prevents a BGP speaker from advertising a non-best route for a destination.

In certain topologies Best-External can improve convergence times, reduce route oscillation and allow better loadsharing. This is achieved because routers internal to the AS have knowledge of more exit paths from the AS. Enabling Add-Paths on border routers of the AS can achieve a similar result but Add-Paths introduces NLRI format changes that must be supported by BGP peers of the border router and therefore has more interoperability constraints than Best-External (which requires no messaging changes).

Best-External is supported in the base router BGP context. (A related feature is also supported in VPRNs; consult the Services Guide for more details.) It is configured using the advertise-external command, which provides IPv4, label-IPv4, IPv6, and label-IPv6 as options.

The advertisement rules when advertise-external is enabled can be summarized as follows:

  1. If a router has advertise-external enabled and its best overall route is a route from an IBGP peer then this best route is advertised to EBGP and confed-EBGP peers, and the “best external” route is advertised to IBGP peers. The “best external” route is the one found by running the BGP path selection algorithm on all LOC-RIB paths except for those learned from the IBGP peers.
    Note:

    A route reflector with advertise-external enabled does not include IBGP routes learned from other clusters in its definition of ‘external’.

  2. If a router has advertise-external enabled and its best overall route is a route from an EBGP peer then this best route is advertised to EBGP, confed-EBGP, and IBGP peers.
  3. If a router has advertise-external enabled and its best overall route is a route from a confed-EBGP peer in member AS X then this best route is advertised to EBGP, IBGP peers and confed-EBGP peers in all member AS except X and the “best external” route is advertised to confed-EBGP peers in member AS X. In this case the “best external” route is the one found by running the BGP path selection algorithm on all RIB-IN paths except for those learned from member AS X.
    Note:

    If the best-external route is not the best overall route it is not installed in the forwarding table and in some cases this can lead to a short-duration traffic loop after failure of the overall best path.

5.6.3.7. Add-Paths

Add-Paths is a BGP enhancement that allows a BGP router to advertise multiple distinct paths for the same prefix/NLRI. Add-Paths provides a number of potential benefits, including reduced routing churn, faster convergence, and better loadsharing.

For a router to receive multiple paths per NLRI from a peer, for a particular address family, the peer must announce the BGP capability to send multiple paths for the address family and the local router must announce the BGP capability to receive multiple paths for the address family. When the Add-Path capability has been negotiated this way, all advertisements and withdrawals of NLRI by the peer must include a path identifier. The path identifier has no significance to the receiving router. If the combination of NLRI and path identifier in an advertisement from a peer is unique (does not match an existing route in the RIB-IN from that peer) then the route is added to the RIB-IN. If the combination of NLRI and path identifier in a received advertisement is the same as an existing route in the RIB-IN from the peer then the new route replaces the existing one. If the combination of NLRI and path identifier in a received withdrawal matches an existing route in the RIB-IN from the peer, then that route is removed from the RIB-IN.

An UPDATE message carrying an IPv4 NLRI with a path identifier is shown in Figure 26.

Figure 26:  BGP Update Message with Path Identifier for IPv4 NLRI 

Add-Paths is only supported by the base router BGP instance and the EBGP and IBGP sessions it forms with other peers capable of Add-Paths. The ability to send and receive multiple paths per prefix is configurable per family, with the supported options being:

  1. IPv4
  2. Label-IPv4
  3. VPN-IPv4
  4. IPv6
  5. Label-IPv6
  6. VPN-IPv6
  7. MCAST-VPN-IPv4
  8. MCAST-VPN-IPv6
  9. MVPN-IPv4
  10. MVPN-IPv6

5.6.3.7.1. Path Selection with Add-Paths

The local RIB may have multiple paths for a prefix. The path selection mode refers to the algorithm used to decide which paths to advertise to an add-paths peer. SR OS supports a send N path selection algorithm (refer to draft-ietf-idr-add-paths-guidelines) and a send multipaths selection algorithm.

The send N algorithm selects the N best advertisable paths that meet these constraints:

  1. The BGP next-hop of the route is unique.
  2. The BGP route is not rejected by an export policy.
  3. The BGP route is not blocked by a split-horizon rule.
  4. The number of advertised paths does not exceed N. N is derived from the send-limit of the best BGP RIB-IN as applied by BGP import policy action or the configuration of the add-paths command that applies to the neighbor and the address family.

The send multipaths algorithm selects the N best advertisable paths that meet these constraints:

  1. The BGP route is a multipath (in other words, it is tied with the best path up to and including the NH-cost comparison step of the decision process, skipping steps that do not apply).
  2. The BGP next-hop of the route is unique.
  3. The BGP route is not rejected by an export policy.
  4. The BGP route is not blocked by a split-horizon rule.

5.6.3.8. Split-Horizon

Split-horizon refers to the action taken by a router to avoid advertising a route back to the peer from which it was received. By default, SR OS applies split-horizon behavior only to routes received from IBGP non-client peers, and split-horizon only works for routes to non-imported routes within a RIB. This split-horizon functionality, which can never be disabled, prevents a route learned from a non-client IBGP peer to be advertised to the sending peer or any other non-client peer.

To apply split-horizon behavior to routes learned from RR clients, confed-EBGP peers or (non-confed) EBGP peers the split-horizon command must be configured in the appropriate contexts; it is supported at the global BGP, group and neighbor levels. When split-horizon is enabled on these types of sessions, it only prevents the advertisement of a route back to its originating peer; for example, SR OS does not prevent the advertisement of a route learned from one EBGP peer back to a different EBGP peer in the same neighbor AS.

5.7. BGP Monitoring Protocol

The BGP Monitoring Protocol (BMP) provides a monitoring station that obtains route updates and statistics from a BGP router. The BMP protocol is described in detail in RFC 7854, BGP Monitoring Protocol (BMP). A router communicates information about one or more BGP sessions to a BMP station. Specifically, BMP allows a BGP router to advertise the pre-policy or post-policy BGP RIB-In from specific BGP peers to a monitoring station. This allows the monitoring station to monitor the routing table size, identify issues, and monitor trends in the table size and update or withdraw the frequency. The BMP station is also sometimes called a BMP collector. A router sends information in BMP messages to a BMP station.

BMP is a unidirectional protocol. A BMP station never sends back any messages to a router.

BMP allows a router to report different types of information.

  1. A router can send BMP messages with notifications when neighbors go into or out of the established mode (for example, when the peer goes up or down). These notifications are called BMP peer-up and peer-down messages.
  2. A router can periodically send statistical information about one or more neighbors. This information consists of a number of counters, for example, the number of routes received from a particular neighbor or the number of rejected or accepted routes due to ingress policy parameters.
    Other counters report the number of errors that were encountered, for example, AS-path loops, duplicate prefixes, or withdrawals received.
  3. A router can also report the exact routes received from a particular neighbor. This action is called route monitoring. A router encapsulates a BGP route into the original BGP update message, then encapsulates that BGP update message in a BMP route monitoring message.

BMP on an SR OS router reports information about routes that were received from a neighbor. The SR OS cannot report routes that were sent to a neighbor.

When periodic statistics are enabled, the router sends all the statistics as described in RFC 7854, section 4.8, with the exception of statistic number 13, “Number of duplicate update messages received”. The supported statistics are listed in Table 9.

Table 9:  Supported Statistics   

Statistic

Type

Number of Prefixes rejected by inbound policy

0

Number of duplicate prefix advertisements received

1

Number of duplicate withdraws received

2

Number of invalidated prefixes due to Cluster_List loop detection

3

Number of invalidated prefixes due to AS_PATH loop detection

4

Number of invalidated prefixes due to Originator ID validation

5

Number of invalidated prefixes due to AS-Confed loop detection

6

Total number of routes in adj-rib-in (all families)

7

Total number of routes in Local-RIB (all families)`

8

Number of routes per address-family in adj-rib-in

9

Number of routes per address-family in loc-rib

10

Number of updates subjected to treat-as-withdraw

11

Number of prefixes subjected to treat-as-withdraw

12

Note:

Statistics 9 and 10 are per address family. The address family is specified as an AFI/SAFI pair. Regardless of which families are configured for route-monitoring, a router reports the statistics of all address families that were negotiated with the neighbor.

The values in these counters are the same values that can be seen in the show>router>bgp>neighbor ip-address [detail] command in the CLI.

5.8. BGP Applications

SR OS implements the following BGP applications:

5.8.1. BGP FlowSpec

FlowSpec is a standardized method for using BGP to distribute traffic flow specifications (flow routes) throughout a network. A flow route carries a description of a flow in terms of packet header fields such as source IP address, destination IP address, or TCP/UDP port number and indicates (through a community attribute) an action to take on packets matching the flow. The primary application for FlowSpec is DDoS mitigation.

FlowSpec is supported for both IPv4 and IPv6. To exchange IPv4 FlowSpec routes with a BGP peer the flow-ipv4 keyword must be part of the family command that applies to the session and to exchange IPv6 FlowSpec routes with a BGP peer flow-ipv6 must be present in the family configuration.

The NLRI of an IPv4 flow route can contain one or more of the subcomponents shown in Table 10.

Table 10:  Subcomponents of IPv4 Flow Route NLRI 

Subcomponent Name [Type]

Value Encoding

SR OS Support

Destination IPv4 Prefix [1]

Prefix length, prefix

Yes

Source IPv4 Prefix [2]

Prefix length, prefix

Yes

IP Protocol [3]

One or more (operator, value) pairs

Partial. No support for multiple values other than “TCP or UDP”.

Port [4] 1

One or more (operator, value) pairs

Yes

Destination Port [5]

One or more (operator, value) pairs

Yes

Source Port [6]

One or more (operator, value) pairs

Yes

ICMP Type [7]

One or more (operator, value) pairs

Partial. Only a single value is supported.

ICMP Code [8]

One or more (operator, value) pairs

Partial. Only a single value is supported.

TCP Flags [9]  2

One or more (operator, bitmask) pairs

Yes

Packet Length [10]

One or more (operator, value) pairs

Yes

DSCP [11]

One or more (operator, value) pairs

Partial. Only a single value is supported.

Fragment [12]

One or more (operator, bitmask) pairs

Partial. No support for matching DF bit, first-fragment or last-fragment.

    Notes:

  1. The Port [4] subcomponent specifies both source and destination ports.
  2. The following restrictions apply.
    1. FP4-based platforms support multiple (operator, bitmask) pairs, provided a single TCP flag bit is matched in each bitmask pair and the match bit is set to 0, resulting in an AND operation between the TCP flags.
    2. Multiple TCP flags can be set in the same (operator, bitmask) pair, provided there is a single pair in the NLRI component with match bit is set to 1 and not bit set to 0.
    3. FP2- and FP3-based platforms support SYN and ACK only.

The NLRI of an IPv6 flow route can contain one or more of the subcomponents shown in Table 11.

Table 11:  Subcomponents of IPv6 Flow Route NLRI 

Subcomponent Name [Type]

Value Encoding

SR OS Support

Destination IPv6 Prefix [1]

Prefix length, prefix offset, prefix

Partial. No support for prefix offset.

Source IPv6 Prefix [2]

Prefix length, prefix offset, prefix

Partial. No support for prefix offset.

Next Header [3]

One or more (operator, value) pairs

Partial. Only a single value supported.

Port [4] 1

One or more (operator, value) pairs

Yes

Destination Port [5]

One or more (operator, value) pairs

Yes

Source Port [6]

One or more (operator, value) pairs

Yes

ICMP Type [7]

One or more (operator, value) pairs

Partial. Only a single value is supported.

ICMP Code [8]

One or more (operator, value) pairs

Partial. Only a single value is supported.

TCP Flags [9]

One or more (operator, bitmask) pairs

Partial. Only SYN and ACK flags can be matched.

Packet Length [10]

One or more (operator, value) pairs

Yes

Traffic Class [11]

One or more (operator, value) pairs

Partial. Only a single value is supported.

Fragment [11]

One or more (operator, bitmask) pairs

Partial. No support for matching Last Fragment.

Flow Label [13]

One or more (operator, value) pairs

Partial. Only a single value is supported.

    Note:

  1. The Port [4] subcomponent specifies both source and destination ports.

Table 12 summarizes the actions that may be associated with IPv4 flow-spec routes. Table 13 summarizes the actions that may be associated with IPv6 flow-spec routes.

Table 12:  IPv4 FlowSpec Actions 

Action

Encoding

SR OS Support

rate limit

Extended community type 0x8006

Yes

sample/log

Extended community type 0x8007

S-bit

Yes

next entry

Extended community type 0x8007

T-bit

Redirect to VRF

Extended community type 0x8008

Yes

Mark traffic class

Extended community type 0x8009

Yes

Redirect to IPv4

Extended community type 0x010c

Yes

Redirect to IPv6

Extended community type 0x000c

Redirect to LSP

Extended community type 0x0900

Partial, only support for ID-type 0x00 (localized ID)

Table 13:  IPv6 FlowSpec Actions 

Action

Encoding

SR OS Support

rate limit

Extended community type 0x8006

Yes

sample/log

Extended community type 0x8007

S-bit

Yes

next entry

Extended community type 0x8007

T-bit

Redirect to VRF

Extended community type 0x8008

Yes

Mark traffic class

Extended community type 0x8009

Yes

Redirect to IPv4

Extended community type 0x010c

Redirect to IPv6

Extended community type 0x000c

Yes

Redirect to LSP

Extended community type 0x0900

Partial, only support for ID-type 0x00 (localized ID)

5.8.1.1. Validating Received FlowSpec Routes

Received FlowSpec-IPv4 and FlowSpec-IPv6 routes are validated following the procedures documented in RFC 5575 and draft-ietf-idr-bgp-flowspec-oid-03, Revised Validation Procedure for BGP Flow Specifications. Configure the validate-dest-prefix command in a routing instance for the validation checks based on destination prefix to be applied. By default, no checking is done. When the command is enabled, BGP determines whether a FlowSpec route is valid or invalid based on the following logic:

  1. If the FlowSpec route was originated in the same autonomous system (AS) as the receiving BGP router then it is automatically valid.
  2. If rule 1 does not apply and the FlowSpec route was originated in an external AS, and it does not contain a destination prefix subcomponent, then it is considered valid.
  3. If rule 1 does not apply and the FlowSpec route was originated in an external AS, and it does contain a destination prefix subcomponent, then it is considered if all of the following are true.
    1. The neighbor AS (last non-confed AS in the AS_PATH) of the FlowSpec route matches the neighbor AS of the unicast IP route that is the best match of the destination prefix. The best match unicast IP route must be a BGP route (that is, not static, IGP, or other routes).
    2. The neighbor AS of the FlowSpec route matches the neighbor AS of all unicast IP routes that are longer matches of the destination prefix. All longer match unicast IP routes must be BGP routes (that is, not static, IGP, or other routes).

FlowSpec-IPv4 routes that are received with a redirect-to-IPv4 extended community action are also be subject to a further set of validation checks. If the validate-redirect-ip command is enabled in the receiving BGP instance, then a FlowSpec-IPv4 route is considered invalid if it is deemed to have originated in a different AS than the IP route that resolves the redirection IPv4 address. The originating AS of a FlowSpec route is determined from its AS paths.

A FlowSpec route that is determined to be invalid by any of the validation rules described earlier is retained in the BGP RIB, but not used for traffic filtering and not propagated to other BGP speakers.

FlowSpec routes received with a redirect-to-IPv4 or redirect-to-IPv6 extended community action are also subject to a further set of validation checks. If the config>router>bgp>flowspec>validate-redirect-ip command is enabled in the receiving BGP instance, then a FlowSpec route is considered invalid if it is deemed to have originated in a different AS than the IP route that resolves the redirection address. The originating AS of a FlowSpec route is determined from its AS path.

5.8.1.2. Using Flow Routes to Create Dynamic Filter Entries

When the base router BGP instance receives an IPv4 or IPv6 flow route and that route is valid/best, the system attempts to construct an IPv4 or IPv6 filter entry from the NLRI contents and the actions encoded in the UPDATE message. If successful, the filter entry is added to the system-created “fSpec-0” IPv4 embedded filter or to the “fSpec-0” IPv6 embedded filter. These embedded filters can be inserted into configured IPv4 and IPv6 filter policies that are applied to ingress traffic on a selected set of the base router IP interfaces. These interfaces can include network interfaces, IES SAP interfaces, and IES spoke SDP interfaces.

Similarly, filter entries can be added to system-created “fSpec-$vprnId” embedded filters for use with VPRN interfaces.

When FlowSpec rules are embedded into a user-defined filter policy, the insertion point of the rules is configurable through the offset parameter of the embed-filter command. The sum of the ip-filter-max-size and offset must not exceed the maximum filter entry-id range.

5.8.2. Configuration of TTL Propagation for BGP Label Routes

This feature allows the separate configuration of TTL propagation for in transit and CPM generated IP packets at the ingress LER within a BGP label route context.

5.8.2.1. TTL Propagation for RFC 3107 Label Route at Ingress LER

For IPv4 and IPv6 packets forwarded using a RFC 3107 label route in the global routing instance, including label-IPv6, the following command specified with the all value enables TTL propagation from the IP header into all labels in the transport label stack:

  1. config router ttl-propagate label-route-local [none | all]
  2. config router ttl-propagate label-route-transit [none | all]

The none value reverts to the default mode which disables TTL propagation from the IP header to the labels in the transport label stack.

These commands do not have a no version.

Note:

  1. The TTL of the IP packet is always propagated into the RFC 3107 label itself. The commands only control the propagation into the transport labels, for example, the labels of the RSVP or LDP LSP which the BGP label route resolves to and which are pushed on top of the BGP label.
  2. If the BGP peer advertised the implicit-null label value for the BGP label route, the TTL propagation does not follow the configuration described, but follows the configuration to which the BGP label route resolves:
  1. RSVP LSP shortcut:
    1. configure router mpls shortcut-transit-ttl-propagate
    2. configure router mpls shortcut-local-ttl-propagate
  2. LDP LSP shortcut:
    1. configure router ldp shortcut-transit-ttl-propagate
    2. configure router ldp shortcut-local-ttl-propagate

This feature does not impact packets forwarded over BGP shortcuts. The ingress LER operates in uniform mode by default and can be changed into pipe mode using the configuration of TTL propagation for RSVP or LDP LSP shortcut.

5.8.2.2. TTL Propagation for RFC 3107 Label Routes at LSR

This feature configures the TTL propagation for transit packets at a router acting as an LSR for a BGP label route.

When an LSR swaps the BGP label for a IPv4 prefix packet, thus acting as a ABR, ASBR, or data-path Route-Reflector (RR) in the base routing instance, or swaps the BGP label for a vpn-IPv4 or vpn-IPv6 prefix packet, thus acting as an inter-AS Option B VPRN ASBR or VPRN data path Route-Reflector (RR), the all value of the following command enables TTL propagation of the decremented TTL of the swapped BGP label into all LDP or RSVP transport labels.

  1. config router ttl-propagate lsr-label-route [none | all]

When an LSR swaps a label or stitches a label, it always writes the decremented TTL value into the outgoing swapped or stitched label. What the above CLI controls is whether this decremented TTL value is also propagated to the transport label stack pushed on top of the swapped or stitched label.

The none value reverts to the default mode which disables TTL propagation. This changes the existing default behavior which propagates the TTL to the transport label stack. When a customer upgrades, the new default becomes in effect. The above commands do not have a no version.

The following describes the behavior of LSR TTL propagation in a number of other use cases and indicates if the above CLI command applies or not:

  1. When an LSR stitches an LDP label to a BGP label, the decremented TTL of the stitched label is propagated or not to the LDP or RSVP transport labels as per the above configuration.
  2. When an LSR stitches a BGP label to an LDP label, the decremented TTL of the stitched label is automatically propagated into the RSVP label if the outgoing LDP LSP is tunneled over RSVP. This behavior is not controlled by the above CLI.
  3. When an LSR pops a BGP label and forwards the packet using an IGP route (IGP route to destination of prefix wins over the BGP label route), it pushes an LDP label on the packet and the TTL behavior is as described in the previous bullet when stitching from a BGP label to an LDP label.
  4. When an ingress Carrier Supporting Carrier (CsC) PE swaps the incoming EBGP label into a VPN-IPv4 label. The reverse operation is performed by the egress CsC PE. In both cases, the decremented TTL of the swapped label is or is not passed on to the LDP or RSVP transport labels as per the above configuration.

5.8.3. BGP Prefix Origin Validation

BGP prefix origin validation is a solution developed by the IETF SIDR working group for reducing the vulnerability of BGP networks to prefix mis-announcements and certain man-in-the-middle attacks. BGP has traditionally relied on a trust model where it is assumed that when an AS originates a route it has the right to announce the associated prefix. BGP prefix origin validation takes extra steps to ensure that the origin AS of a route is valid for the advertised prefix.

7450, 7750, and 7950 routers support BGP prefix origin validation for IPv4 and IPv6 routes received from selected peers. When prefix origin validation is enabled on a base router BGP or VPRN BGP session using the enable-origin-validation command, every received IPv4 and/or IPv6 route received from the peer is checked to determine whether the origin AS is valid for the received prefix. The origin AS is the first AS that was added to the AS_PATH attribute and indicates the autonomous system that originated the route.

For purposes of determining the origin validation state of received BGP routes, the router maintains an Origin Validation database consisting of static and dynamic entries. Each entry is called a VRP (Validated ROA Payload) and associates a prefix (range) with an origin AS.

Static VRP entries are configured using the static-entry command available in the config>router>origin-validation context of the base router. In SR OS, a static entry can express that a specific prefix and origin AS combination is either valid or invalid.

Dynamic VRP entries are learned from RPKI local cache servers and express valid origin AS and prefix combinations. The router communicates with RPKI local cache servers using the RPKI-RTR protocol. SR OS supports the RPKI-RTR protocol over TCP/IPv4 or TCP/IPv6 transport; TCP-MD5 and other forms of session security are not supported. 7450, 7750, and 7950 routers can set up an RPKI-RTR session using the base routing table (in-band) or the management router (out-of-band). For more information, refer to the origin-validation configuration command and show commands in the Router Configuration Guide.

An RPKI local cache server is one element of the larger RPKI system. The RPKI is a distributed database containing cryptographic objects relating to Internet Number resources. Local cache servers are deployed in the service provider network and retrieve digitally signed Route Origin Authorization (ROA) objects from Global RPKI servers. The local cache servers cryptographically validate the ROAs before passing the information along to the routers.

The algorithm used to determine the origin validation states of routes received over a session with enable-origin-validation configured uses the following definitions:

  1. A route is matched by a VRP entry if all of the following occurs:
    1. the prefix bits in the route match the prefix bits in the VRP entry (up to its min prefix length
    2. the route prefix length is greater than or equal to the VRP entry min prefix length
    3. the route prefix length is less than or equal to the VRP entry max prefix length
    4. the origin AS of the route matches the origin AS of the VRP entry
  2. A route is covered by a VRP entry if all of the follow occurs:
    1. the prefix bits in the route match the prefix bits in the VRP entry (up to its min prefix length)
    2. the route prefix length is greater than or equal to the VRP entry min prefix length
    3. the VRP entry type is static-valid or dynamic

Using the above definitions, the origin validation state of a route is based on the following rules.

  1. If a route is matched by at least one VRP entry, and the most specific of these matching entries includes a static-invalid entry then the origin validation state is Invalid (2).
  2. If a route is matched by at least one VRP entry, and the most specific of these matching entries does not include a static-invalid entry then the origin validation state is Valid (0).
  3. If a route is not matched by any VRP entry, but it is covered by at least one VRP entry then the origin validation state is Invalid (2).
  4. If a route is not covered by any VRP entry then the origin validation state is Not-Found (1).

Consider the following example. Suppose the Origin Validation database has the following entries:

10.1.0.0/16-32, origin AS=5, dynamic

10.1.1.0/24-32, origin AS=4, dynamic

10.0.0.0/8-32, origin AS=5, static invalid

10.1.1.0/24-32, origin AS=4, static invalid

In this case, the origin validation state of the following routes are as indicated:

10.1.0.0/16 with AS_PATH {…5}: Valid

10.1.1.0/24 with AS_PATH {…4}: Invalid

10.2.0.0/16 with AS_PATH {…5}: Invalid

10.2.0.0/16 with AS_PATH {…6}: Not-Found

The origin validation state of a route can affect its ranking in the BGP decision process. When origin-invalid-unusable is configured, all routes that have an origin validation state of ‘Invalid’ are considered unusable by the best path selection algorithm, that is, they cannot be used for forwarding and cannot be advertised to peers.

If origin-invalid-unusable is not configured then routes with an origin validation state of ‘Invalid’ are compared to other “usable” routes for the same prefix according to the BGP decision process.

When compare-origin-validation-state is configured a new step is added to the BGP decision process after removal of invalid routes and before the comparison of local preference. The new step compares the origin validation state, so that a route with a ‘Valid’ state is preferred over a route with a ‘Not-Found’ state, and a route with a ‘Not-Found’ state is preferred over a route with an ‘Invalid’ state assuming that these routes are considered ‘usable’. The new step is skipped if the compare-origin-validation-state command is not configured.

Route policies can be used to attach an Origin Validation State extended community to a route received from an EBGP peer in order to convey its origin validation state to IBGP peers and save them the effort of repeating the Origin Validation database lookup. To add an Origin Validation State extended community encoding the ‘Valid’ result, the route policy should add a community list that contains a member in the format ext:4300:0. To add an Origin Validation State extended community encoding the ‘Not-Found’ result, the route policy should add a community list that contains a member in the format ext:4300:1. To add an Origin Validation State extended community encoding the ‘Invalid’ result, the route policy should add a community list that contains a member in the format ext:4300:2.

5.8.4. BGP Route Leaking

It is possible to leak a copy of a BGP route (including all its path attributes) from one routing instance RIB to another routing instance RIB of the same type (labeled or unlabeled) in the same router. Leaking is supported from the GRT to a VPRN, from one VPRN to another VPRN, and from a VPRN to the GRT. Any valid BGP route for an IPv4, IPv6, or label-IPv4 prefix can be leaked. A BGP route does not have to be the best path or used for forwarding in the source instance in order to be leaked.

An IPv4, IPv6, or label-IPv4 BGP route becomes a candidate for leaking to another instance when it is specially marked by a BGP import policy. This special marking is achieved by accepting the route with a bgp-leak action in the route policy. Routes that are candidates for leaking to other instances show a leakable flag in the output of various show router BGP commands. In order to copy a leakable BGP route received in a source instance S into the BGP RIB of a target instance T, the target instance must be configured with a leak-import policy that matches and accepts the leakable route. Different leak-import policies can be specified for each of the following RIBs: IPv4, label-IPv4, and IPv6. Up to 15 leak-import policies can be chained together for more complex use cases. leak-import policies are configured in the config>router>bgp>rib-management>ipv4 context.

Note:

Using a leak-import policy to change the BGP attributes of leaked routes (compared to the original source copy) is not supported. The only attribute that can be changed is the RTM preference.

In the target instance, leaked BGP routes are compared to other (leaked and non-leaked) BGP routes for the same prefix based on the complete BGP decision process. Leaked routes do not have information about the router ID and peer IP address of the original peer and use all-zero values for these properties.

BGP always tries to resolve the BGP next hop of a leaked route using the route and tunnel table of the original (source) routing instance and this resolution information is carried with the leaked route, avoiding the need to leak the resolving routes as well. If there is no resolving the route or tunnel in the source instance, then the unresolved route cannot be leaked unless allow-unresolved-leaking is configured and the source routing instance is the GRT. In this case, the importing VPRN tries to resolve the BGP next hop of the leaked route by using its own route table (and according to its own BGP next-hop-resolution configuration options).

If a target instance has BGP multipath and ECMP enabled and some of the equal-cost best paths for a prefix are leaked routes, they can be used along with non-leaked best paths as ECMP next hops of the route.

When BGP fast reroute is enabled in a target instance T (for a particular IP prefix), BGP attempts to find a qualifying backup path by considering both leaked and non-leaked BGP routes. The backup path criteria are unchanged by this feature, that is, the backup path is the best remaining path after the primary paths and all paths with the same BGP next hops as the primary paths have been removed.

A leaked BGP route can be advertised to direct BGP neighbors of the target routing instance. The BGP next hop of a leaked route is automatically reset to itself whenever it is advertised to a peer of the target instance. Normal route advertisement rules apply, meaning that by default, the leaked route is advertised only if (in the target instance) it is the overall best path and is used as the active route to the destination and is not blocked by the IBGP-to-IBGP split-horizon rule.

A BGP route resolved in the source routing instance and leaked into a VPRN can be exported from the VPRN as a VPN-IPv4 or VPN-IPv6 route if it matches the VRF export policy. In this case, normal VPN export rules apply, meaning that by default, the leaked route is exported only if (in the VPRN) it is the overall best path and is used as the active route to the destination.

A BGP route that is unresolved in the GRT, leaked into a VPRN, and resolved by a BGP-VPN route in the VPRN cannot be exported from the VPRN as a VPN-IPv4 or VPN-IPv6 route unless it matches the VRF export policy and the VPRN is configured with the allow-bgp-vpn-export command.

Note:

A leaked route cannot be exported as a VPN-IP route and then re-imported into another local VPRN.

5.8.5. BGP Optimal Route Reflection

BGP Route Reflectors (RRs) are used in networks to improve network scalability by eliminating or reducing the need for a full-mesh of IBGP sessions. When a BGP RR receives multiple paths for the same IP prefix, it typically selects a single best path to send to all clients. If the RR has multiple nearly-equal best paths and the tie-break is determined by the next-hop cost, the RR advertises the path based on its view of next-hop costs. The advertised route may differ from the path that a client would select if it had visibility of the same set of candidate paths and used its own view of next-hop costs.

Non-optimal advertisements by the RR can be a problem in hot-potato routing designs. Hot-potato routing aims to hand off traffic to the next AS using the closest possible exit point from the local AS. In this context, the closest exit point implies minimum IGP cost to reach the BGP next-hop. SR OS implements the hot-potato routing solution described in draft-ietf-idr-bgp-optimal-route-reflection

Optimal Route Reflection (ORR) is supported in the base router BGP instance only. It applies to routes in the following address families: IPv4 unicast, label-IPv4, label-IPv6 (6PE), VPN-IPv4, and VPN-IPv6.

Note:

For the RR to compare two VPN routes (and therefore for ORR to apply), the routes must contain the same RD and IP prefix information.

ORR locations are created when config>router>bgp>orr>location is configured. The RR can maintain information for a maximum of 255 ORR locations. A primary IPv4 or primary IPv6 address is required for each location; optionally, specify a secondary and tertiary IPv4 and IPv6 addresses for the location. The IP addresses are used to find a node in the network topology that can serve as the root for SPF calculations. The IP addresses must correspond to loopback or system IP addresses of routers that participate in IGP protocols. The secondary and tertiary IP address parameters provide redundancy in case the node selected to be root for the SPF calculations disappears.

The route reflector's TE database, populated with information from local IGP instances or BGP-LS NLRI, is used to compute the SPF cost from each ORR location to IPv4 and IPv6 BGP next-hops in the candidate set of best paths. The use of BGP-LS allows the route reflector to learn IGP topology information for OSPF areas, IS-IS levels, and others in which the route reflector is not a direct participant.

To configure an ORR client, configure the cluster command for the BGP session to reference one of the defined ORR locations. The association of a client with an ORR location is not automatic. Choose an ORR location as close as possible to the client that is being configured. The allow-local-fallback option of the cluster command affects RR behavior when no BGP routes are reachable from the ORR location of the client. When allow-local-fallback is configured, the RR is allowed, in this circumstance only, to advertise the best reachable BGP path from its own topology location. If allow-local-fallback is not configured and this situation applies, then no route is advertised to the client.

Note:

ORR is supported with Add-Paths; Add-Paths advertised to an ORR client are based on ORR location.

5.8.6. LSP Tagging for BGP Next-Hops or Prefixes and BGP-LU

The tunnels used by the system for resolution of BGP next-hops or prefixes and BGP labeled unicast routes can be constrained using LSP administrative tags. Refer to the “LSP Tagging and Auto-Bind Using Tag Information” section of the 7450 ESS, 7750 SR, 7950 XRS, and VSR MPLS Guide for more information.

5.8.7. BGP-LS

BGP-LS is a new BGP address family that is intended to distribute IGP topology information to external servers such as Application Later Traffic Optimization (ALTO) or Path Computation Engines (PCE) servers. These external traffic engineering databases can then use this information when calculating optimal paths.

BGP-LS provides external ALTO and PCE servers with topology information for a multi-area or multi-level network. Through the use of one or two BGP-LS speakers per area or level, the external ALTO or PCE servers can receive full topology information for the entire network. The BGP-LS information can also be distributed through route reflectors supporting the BGP-LS to minimize the peering requirements.

Figure 27 shows a sample BGP-LS network.

Figure 27:  Sample BGP-LS Network 

5.8.7.1. Supported BGP-LS components

The following BGP-LS components are currently supported.

Protocol-ID:

  1. IS-IS Level 1
  2. IS-IS Level 2
  3. OSPFv2

NLRI Types:

  1. Node NLRI
  2. Link NLRI
  3. IPv4 Topology Prefix NLRI

Node Descriptor TLVs:

  1. 512 — Autonomous System
  2. 513 — BGP-LS Identifier
  3. 514 — OSPF Area-ID
  4. 515 — IGP Router-ID

Node Attribute TLVs:

  1. 1024 — Node Flag Bits (O & B bits supported)
  2. 1028 — IPv4 Router-ID of Local Node (only supported for IS-IS)
  3. Segment Routing:
    1. 1034 — SR Capabilities (only supported for IS-IS)
    2. 1035 — SR Algorithm (only supported for IS-IS)

Link Descriptors TLVs:

  1. 258 — Link Local/Remote Identifiers
  2. 259 — IPv4 interface address
  3. 260 — IPv4 neighbor address

Link Attributes TLVs:

  1. 1028 — IPv4 Router-ID of Local Node (only supported for IS-IS)
  2. 1088 — Administrative group (color)
  3. 1089 — Maximum link bandwidth
  4. 1090 — Max. reservable link bandwidth
  5. 1091 — Unreserved bandwidth
  6. 1092 — TE Default Metric
  7. 1095 — IGP Metric
  8. 1096 — Shared Risk Link Group
  9. Segment Routing:
    1. 1099 — Adjacency Segment Identifier
    2. 1100 — LAN Adjacency Segment Identifier

Prefix Descriptors TLVs:

  1. 264 — OSPF Route Type (only Intra-Area and Inter-Area)
  2. 265 — IP Reachability Information

Prefix Attributes TLVs:

  1. 1152 — IGP Flags (only D flag supported)
  2. 1155 — Prefix Metric
  3. Segment Routing:
    1. 1158 — Prefix SID
    2. 1159 — Range (prefix-SID & sub-TLV Only)
    3. 1170 — IGP Prefix Attributes (only supported for OSPF)

5.8.8. BGP-LU Traffic Statistics

SR OS can collect BGP-LU traffic statistics.

Traffic statistics can be collected on egress data paths. This requires the use of egress-statistics keyword when creating an import policy and that the BGP tunnel exists for the corresponding prefix. If multiple paths exist (for example, ECMP), a single statistical index is allocated and reflects the traffic sent over all paths.

Traffic statistics can also be collected on ingress data paths if the label is assigned and effectively advertised per prefix. This typically requires the use of advertise-label per-prefix when creating the import policy and applies whether or not the sr-label-index keyword is in use. However, there are cases where this may not result in a per-prefix label advertisement. When a non-BGP route (for example, static route) is requested to be advertised (advertise-inactive) with a Label Per Prefix (LPP) policy but it exists as an active RTM route and as inactive BGP route, the system does not use the LPP but instead uses the LPNH policy. Statistics are not counted for this prefix. An imported (local loopback) SR label route can also be configured to use the ingress-statistics keyword by using a route table import policy under rib-management (either label-ipv4 or label-ipv6).

Overall, BGP-LU statistics apply at the:

  1. PE and forwarding RR on egress
  2. ASBR on both ingress and egress
Note:

  1. Only host prefixes (/32 and /128) are supported on egress statistics.
  2. Host and non-host prefixes are supported on ingress statistics.

Control messages sent over the BGP-LU tunnel are accounted for in traffic statistics.

BGP-LU statistics are not supported for imported LDP routes (ldp-bgp stitching) or for VPN labels (for example, inter-AS B or C).

5.8.9. BGP Egress Peer Engineering for Segment Routing

BGP Egress Peer Engineering (BGP EPE) extends the source-based segment routing (SR) capabilities beyond the AS boundary toward directly attached BGP peers. Operators can use a central controller to enforce more programmatic control of traffic distribution across these BGP peering links.

An SR SID can be allocated to a BGP peering segment and advertised in BGP Link State (BGP-LS) toward a controller, such as the Nokia NSP. The instantiation of the following BGP peering segments is supported:

  1. an eBGP or iBGP peer (peer node SIDs as defined in RFC 8402, Segment Routing Architecture)
  2. a link to such a peer (peer adjacency SIDs as defined in RFC 8402, Segment Routing Architecture)

The controller includes the specific SID in the path for an SR-TE LSP or SR policy, which it programs at the head-end LER. EPE enables a head-end router to steer traffic across a downstream peering link to a node, for traffic optimization, resiliency, or load-balancing purposes.

Figure 28 shows an example use case for BGP EPE.

Figure 28:  EPE Example Use Case 

In this example, there are two IGP domains with eBGP running between R5 and R6, and between R5 and R8. These adjacencies are not visible to R1, which is external to the IGP domain. At R5, separate peer node SIDs are allocated for R6 and R8. The peer node SIDs are advertised to the Nokia NSP in BGP-LS. This allows the NSP to compute a path across either the R5-R6 adjacency or the R5-R8 adjacency by including the appropriate peer node SID in the path. In the preceding use case, the R5-R6 adjacency is preferable. This peer node SID can be included in either the SR-ERO of an SR-TE LSP that is computed by PCEP or the segment list of BGP SR policy that is programmed at R1. Traffic on this LSP or SR policy is, therefore, steered across the required peering.

EPE is supported for BGP neighbors with either eBGP or iBGP sessions. SR peer node SIDs and peer adjacency SIDs are supported. The SID labels are dynamically allocated from the local label space on the node and advertised in BGP-LS using the encoding specified in section 4 of draft-ietf-idr-bgpls-segment-routing-epe-19.

The peering node can behave as both an LSR and an LER for steering traffic towards the peering segment. Both ILM and LTN entries are programmed for peer node SIDs and peer adjacency SIDs, with a label swap to or push of an implicit null label.

Note:

  1. BGP EPE only supports neighbor nodes that are directly connected to the egress router (for example, not indirectly through tunnels).
  2. BGP EPE SIDs cannot be used as the top SIDs on a tunnel originating at the ingress node to the peering segment. Only IS-IS or OSPF SIDs can be used as the transport SID.

ECMP is supported by default if there are multiple peer adjacency SIDs. BGP will only allocate peer adjacency SIDs to the ECMP set of next hops toward the peer node. For non-ECMP next hops, only a peer adjacency SID is allocated and it is advertised if all ECMP sets go down.

LSP ping and LSP trace echo requests are supported by including a label representing a peer node SID or peer adjacency SID in a NIL FEC of the target FEC stack. An EPE router can validate and respond to an LSP ping or trace echo request containing this FEC.

5.8.9.1. Configuring BGP EPE

In addition to enabling the BGP-LS route family for a BGP neighbor, the following CLI is required to send the Egress Peering Segments described in BGP Egress Peer Engineering for Segment Routing using the NLRI Type 2 with protocol ID set to BGP-EPE.

CLI Syntax:
configure>router>bgp
egress-peer-engineering {
admin-state {enable | disable}
}

When egress-peer-engineering is administratively enabled, BGP registers with SR and the router starts advertising any peer node and peer adjacency SIDs in BGP-LS.

To allocate peer node and peer adjacency SIDs, use the following syntax to configure the egress-engineering command and enable BGP-EPE for a BGP neighbor or group.

CLI Syntax:
configure>router>bgp
group
neighbor <a.b.c.d> {
egress-engineering {
admin-state {enable | disable}
}
configure>router>bgp
group
egress-engineering {
admin-state {enable | disable}
}

The BGP egress-engineering at the neighbor level overrides the group level configuration. When a neighbor does not have an egress-engineering configuration context, the group configuration is inherited in the following cases.

  1. If the group does not have an egress-engineering configuration, egress-engineering is disabled for the neighbor.
  2. If the group has an egress-engineering configuration in the default disabled state, egress-engineering is disabled for the neighbor.
  3. If the group has an enabled egress-engineering configuration, egress-engineering is enabled for the neighbor.

When a neighbor has egress-engineering configured and in the default disabled state, egress-engineering is disabled for the neighbor, irrespective of the disabled, enabled, or no-context configuration at the group level. When a neighbor has egress-engineering configured and enabled, egress-engineering is enabled for the neighbor, irrespective of the disabled, enabled, or no-context configuration at the group level.

By default, enabling egress-engineering at the peer or group level causes SID values (MPLS labels) to be dynamically allocated for the peer node segment and the peer adjacency segments. Although the labels are assigned when the neighbor or group is configured, they are not programmed until the adjacency comes up. Peer node segments are derived from the BGP next hops used to reach a specific peer. If the node reboots, these dynamically allocated label values may change and are re-announced in BGP-LS.

If a BGP neighbor goes down, the router advertises a delete for all SIDs associated with the neighbor and deprograms them from the IOM. However, the label values for the SIDs are not released and the router re-advertises the same values when the BGP neighbor comes back up.

If a BGP neighbor is deleted from the configuration or is shut down, or egress-engineering is disabled, the router advertises a delete for all SIDs associated with the neighbor and deprograms them from the IOM. The router also releases the label values for the SIDs.

5.9. BFD Support for BGP

This feature allows BGP peers to be associated with the BFD session. If the BFD session fails, BGP peering is also torn down.

5.10. BGP Configuration Process Overview

Figure 29 displays the process to provision basic BGP parameters.

Figure 29:  BGP Configuration and Implementation Flow 

5.11. Configuration Notes

This section describes BGP configuration caveats.

5.11.1. General

  1. Before BGP can be configured, the router ID and autonomous system should be configured.
  2. BGP must be added to the router configuration. There are no default BGP instances on a router.

5.11.1.1. BGP Defaults

The following list summarizes the BGP configuration defaults:

  1. By default, the router is not assigned to an AS.
  2. A BGP instance is created in the administratively enabled state.
  3. A BGP group is created in the administratively enabled state.
  4. A BGP neighbor is created in the administratively enabled state.
  5. No BGP router ID is specified. If no BGP router ID is specified, BGP uses the router system interface address.
  6. The router BGP timer defaults are generally the values recommended in IETF drafts and RFCs (see BGP MIB Notes)
  7. If no import route policy statements are specified, then all BGP routes are accepted.
  8. If no export route policy statements specified, then all best and used BGP routes are advertised and non-BGP routes are not advertised.

5.11.1.2. BGP MIB Notes

The router implementation of the RFC 1657 MIB variables listed in Table 14 differs from the IETF MIB specification.

Table 14:  SR OS and IETF MIB Variations  

MIB Variable

Description

RFC 1657 Allowed Values

SR OS Allowed Values

bgpPeerMinRouteAdvertisementInterval

Time interval in seconds for the MinRouteAdvertisementInterval timer. The suggested value for this timer is 30.

1 to 65535

1 to 255

A value of 0 is supported when the rapid-update command is applied to an address family that supports it.

If SNMP is used to set a value of X to the MIB variable in Table 14, there are three possible results:

Table 15:  MIB Variable with SNMP  

Condition

Result

X is within IETF MIB values and X is within SR OS values

SNMP set operation does not return an error

MIB variable set to X

X is within IETF MIB values and X is outside SR OS values

SNMP set operation does not return an error

MIB variable set to “nearest” SR OS supported value (for example, SR OS range is 2 - 255 and X = 65535, MIB variable is set to 255)

Log message generated

X is outside IETF MIB values and X is outside SR OS values

SNMP set operation returns an error

When the value set using SNMP is within the IETF allowed values and outside the SR OS values as specified in Table 14 and Table 15, a log message is generated.

The log messages that display are similar to the following log messages:

Sample Log Message for setting bgpPeerMinRouteAdvertisementInterval to 256

535 2006/11/12 19:40:53 [Snmpd] BGP-4-bgpVariableRangeViolation: Trying to set 
bgpPeerMinRouteAdvInt to 256 - valid range is [2-255] - setting to 255

Sample Log Message for setting bgpPeerMinRouteAdvertisementInterval to 1

566 2006/11/12 19:44:41 [Snmpd] BGP-4-bgpVariableRangeViolation: Trying to set 
bgpPeerMinRouteAdvInt to 1 - valid range is [2-255] - setting to 2