MULTICAST DESIGN AND IMPLEMENTATION ON RUNET AND RUNET2000

Document Managed by Network Architecture

Overview

This document details the design considerations and implementation plan for the RUNet / RUNet 2000 IP multicast infrastructure. The reader is assumed to have a casual familiarity with multicast terms and concepts.

Introduction

In the last two years, an unprecedented focus has been placed on the RUNet as a teaching tool and productivity aid. Both the classroom and laboratory are moving into areas where collaborative applications with audio and visual content are the norm. Additionally, software programmers are finding ever more efficient ways of advertising and locating services, applications and content. The role of RUNet in such an environment is to provide ready access and efficient distribution of the data used by these enriched media. In order to satisfy the increasing bandwidth demands placed on the network infrastructure, the Network Operations Group has nearly completed a network-wide backbone upgrade of the legacy RUNet. However, bandwidth alone does not satisfy the service demands-- there is a real need for a stable, reliable and robust multicast infrastructure. A major design goal of the legacy backbone upgrade is the support of such a multicast infrastructure. The ATM core provides an efficient transport solution for the near-term, and advanced optical technologies to be employed in the RUNet 2000 design will satisfy demands in the future. The distribution infrastructure is built on switching technology that supports efficient multicasting, and has enough bandwidth to support today's applications. All routers and switches in service on the RUNet backbone today have the capability of supporting multicast traffic.

Technology Overview

Multicast is a special case of broadcasting where a subset of all possible receivers is the intended recipient. It is sometimes used for as a distribution mechanism audio and video data, typically streaming UDP-type data to a single group address, which all receivers share. Multicast is different than unicast data transport to multiple receivers in that the source does not need to replicate the packet for every receiver. Rather, the sender sends to a single multicast group address, and the network infrastructure will handle the necessary replication of data and delivery to the receivers. IP Multicast group addresses are drawn from a pool of Class D addresses in the range 224.0.0.0 - 239.255.255.255, or 224.0.0.0/4. There is an extensive list of well-known IP multicast group addresses defined in RFC 1700.

The process of routing multicast traffic is similar to a unicast routing process. For each router on the path between an multicast source and receiver, the router receives the packet, compares the destination address with a table of "next hops," and forwards the packet out the appropriate interface. The critical difference between unicast routing and multicast routing is that the "next hop" may be several routers on several different interfaces. In some implementations, multicast traffic is simply forwarded out all interfaces, except the incoming interface. There are several different protocols that govern the population of the multicast routing tables, such as Multiprotocol Border Gateway Protocol (MBGP), Multicast Open Shortest Path First (MOSPF), Protocol Independent Multicast (PIM) and Disctance-Vector Multicast Routing Protocol (DVMRP).

IETF RFC 2236 defines a protocol for IP multicast. The Internet Group Management Protocol version 2 (IGMPv2) provides a mechanism by which hosts can advertise their membership in IP multicast groups. Like ICMP, it is an integral part of the IP protocol suite, and is implemented in most hosts' IP stacks. A host wishing to join a multicast group sends an unsolicited IGMPv2 Membership Report message containing the address of the group it wishes to join. Any router connected to that IP subnetwork will receive this advertisement and add the group to its forwarding table, if that group is not already present. If the group is not part of the router's forwarding table, then the router will add the group, and arrange to begin receiving data for that group. The router periodically queries the group to see if there are still members. All members of the queried group must respond to the query with an IGMPv2 Membership Report. To leave a group, a host should send an IGMPv2 Leave Group message, though it need only stop advertising its membership, and ignore IGMP queries. Once the group has no more members, the router removes the group from its forwarding table. To begin sourcing multicast traffic, a host only need begin transmitting to the group address.

Multicast traffic sometimes requires bandwidths that would consume all network resources on a user LAN. To manage such traffic at the data-link layer, Cisco Systems defined a proprietary protocol called the Cisco Group Management Protocol (CGMP). CGMP is a protocol that works between Cisco routers and switches to establish a list of MAC addresses in the switch's forwarding table that are interested in receiving multicast traffic. The switch populates the table by eavesdropping on IGMP traffic, and learning the MAC addresses of group members. When this table is populated, multicast traffic from the router is only forwarded out ports which have group members attached. Thus, multicast traffic is constrained at the switch from being flooded out all ports, and consuming bandwidth with data that is thrown away by the host.

Protocol Independent Multicast

This document focuses on the design and implementation of a multicast infrastructure using Protocol Independent Multicast (PIM). PIM's primary function as a routing protocol is to discover PIM-capable neighbors and to populate the multicast routing table. It has three modes of operation, dense mode, sparse mode, and a hybrid sparse-dense mode, which differ in their traffic flow characteristics. Generally, PIM maintains the lists of incoming and outgoing interfaces for sources and groups which constitute the multicast routing table. It operates within a multicast domain, which is best described as the set of routers configured to operate as PIM peers. The protocol uses several simple messages such as PIM-Hello, PIM-Join, PIM-Prune and PIM-Register to propagate membership and routing information.

PIM discovers its neighbors by periodically sending PIM Hello messages out all interfaces to a special "all PIM routers" multicast group at address 224.0.1.13. If other PIM-enabled routers are on the network segment, then they respond to the messages, and a Designated Router (DR) is elected, with the PIM router with the highest IP address winning. If there are no other routers on the segment, then the router assumes it is the DR. The DR is the router on the network segment that will source multicast information on that segment, and is responsible for sending PIM Join messages to upstream PIM routers.

PIM populates a multicast routing table with multicast state information. Each routing entry is defined by source and group information, and includes information regarding the incoming and outgoing interfaces, as well as various timers, and flags. There are two states for a multicast entry, (*,G) state, which matches all senders for a particular group, and (S,G) state, which matches a specific source sending to a particular group. The difference in designations is necessary because any multicast group can have multiple senders, and the senders' "upstreams," or incoming interfaces, might be different router interfaces. Thus, for a multicast state (S1,G), the incoming and outgoing interfaces may be different than those for (S2,G). When making multicast routing decisions, PIM looks for a longest match, matching (S,G) first, and then (*,G) if there is no information for that sender. (*,G) state is maintained for all active groups, and is created automatically when a group is requested or sent to. (S,G) states are created as specific sources are learned. A (S,G) state does not exist without a "parent" (*,G) state. Flags are important elements of source-group state, and are commonly used to determine whether data is forwarded or pruned for a given (S,G) state, and whether the group is sparse or dense. PIM routing interaction with (S,G) and flags is detailed for each mode, below.

PIM Dense Mode (PIM-DM) is a PIM configuration that floods all multicast traffic throughout the network. It is used in networks where it is assumed that a majority of the hosts are multicast receivers of a large number of the known multicast groups. It is characterized by persistent (S,G) forwarding state throughout the multicast domain. PIM-DM follows an "implicit-join" model, which means that data flows as if all groups have receivers at the edge of the network. Since this is rarely the case, PIM-DM has potentially complex interactions when the effects of PIM join and prune messages are considered.

Under PIM-DM, a typical prune scenario might occur as follows. If, starting at the farthest-downstream router A, there are no receivers for a multicast group G, then router A will send a PIM-Prune message for that group back through the network to its upstream peer, router B. On router A, group G's outgoing interface list is set to Null, and a timer is started and the state will be removed when the timer expires. The function of the timer is to hold (S,G) state in case there are any late joiners for group G. When router B receives the PIM-Prune message from router A, it has several options. If there are other receivers on router A's network segment, then router B continues to forward traffic for group G out the interface. If there are no other receivers on the segment, then router B removes the interface from group G's outgoing interface list. If the outgoing interface list for group G's (S,G) state entry goes to Null, then the router sends a PIM-Prune message to its upstream peer, and the timer is started to remove the state. The procedure iterates for each router C, D, E, and so on, propagating back upstream toward the source. Thus, PIM-DM (S,G) state information is pruned and removed.

Recalling that the default mode for PIM-DM is to flood all multicast traffic, consider the behavior for the group information when the timer to remove the state expires. When a PIM-DM router receives multicast information for a group for which it does not have an entry, it adds (*,G) state and (S,G) state for that group, adds all eligible interfaces except the incoming interface to the outgoing interface list for that state and begins forwarding data. When an upstream router expires its state and removes it, it will immediately, upon receipt of the next multicast packet for group G, add state and forward out all interfaces. The downstream routers will also add state and forward until the farthest-downstream router times out and begins the prune cycle again. This interaction between flood and prune leads to complex and unexpected oscillations in the network.

PIM Sparse Mode (PIM-SM) is a PIM configuration that subscribes to an "explicit-join" model. It assumes that there are no downstream receivers, and initially forwards no data. Like PIM-DM, it employs PIM Join, Prune and Hello messages to build and maintain (S,G) state, though it makes slightly more rational decisions about when to forward and what to prune. The key difference is where PIM-DM wants to forward everything, PIM-SM wants to prune. Routers interact with hosts using IGMP reports to determine group memberships. When a router learns of a downstream receiver for a group G, it adds (*,G) state with the outgoing interface list populated with the interface in which the membership request entered, and sends a PIM-Join message upstream. (The determination of "upstream" and "downstream" interfaces under PIM-SM will be discussed shortly.) When the router begins receiving information for the group, it creates a (S,G) state for that particular source, and copies the outgoing interface list from the (*,G) entry. The data is forwarded, and a timer is started which will remove the group when it expires. The timer is reset whenever traffic is received that matches that (S,G) entry. (S,G) state is cleared when the timer expires, or when a PIM-Prune message is received, and the prune timer expires.

An important differentiation between PIM-SM and PIM-DM is in the propagation of (S,G) state. Since data is not flooded to the network, disseminating source-group information becomes problematic, and requires the services of a Rendezvous Point (RP). The RP is a router designated in the multicast infrastructure to manage and maintain all or part of the (S,G) state information for a domain. All PIM-SM routers must be configured with the address of a RP, which adds some configuration complexity. Additionally, only one RP address for a particular set of multicast groups may exist, though this shortcoming can be worked around. The RP acts as the root of a Shared, or RP tree (RPT). The RP tree is the initial data source for all new group receivers, until the routers can construct a shortest-path tree (SPT) between a receiver and its source. The construction of the RPT and subsequent construction of SPTs is an important and desirable feature of PIM-SM.

Consider two scenarios, one where a receiver wishes to receive multicast traffic for a group G1, and another where a sender S1 begins sending data to G1. In the first scenario, a receiver sends an IGMP membership report to group G1, stating that it wishes to join. The host's router, router A, receives the request and creates a (*,G1) state, and sends a PIM-Join request for group G1 along the unicast route towards the RP. Each router in-line between router A and the RP will add (*,G1) state and forward a request until either the request is received at the RP, or the request is received at a router where (*,G1) exists. In either case, a router with existing (*,G1) state will begin forwarding data with the RPT-bit set. Thus, a RP Tree is constructed.

When the last-hop router receives data from the RP tree, the data will have come from some source S1. The router adds (S1,G1) state and begin constructing a shortest path tree between itself and the source. The router will send a PIM-Join request out the interface the unicast routing table indicates, with the SPT-bit set. The next router upstream will add (S1,G1) state (if it doesn't already exist) and propagate the request along the unicast path to the source. When the request reaches either a router with an existing (S1,G1) state or the source's first-hop router, the router will add the interface in which the request was received to the outgoing interface list for state (S1,G1) and beings to forward data. When the last hop router before the receiver receives the (S1,G1) data, it sends a PIM-Prune for state (S1,G1) with the RPT-bit set towards the RP. This request propagates along the path to the RP, pruning and ultimately removing the state. Thus, the Shortest Path Tree is constructed and the Shared Tree is torn down.

In the second scenario, a host wishing to source multicast data simply beings sending to the group address G1. The host's router receives the data, but has no (*,G1) state. It encapsulates the multicast packets into unicast PIM-Register packets, and sends them to the RP. The RP receives the PIM-Register messages, and does two things. First, if there are receivers for the (*,G1) state, it forwards the data out the shared tree. In addition, it begins constructing a SPT, sending PIM-Join requests toward the new source. Once the source's router receives the PIM-Join request, it stops sending the PIM-Register messages and switches over to the SPT, forwarding the multicast packets natively. If the RP receives no joiners for group G1 within the timeout, it will prune the group, but retain the group state for any future joiners.

A third, hybrid mode of PIM exists called PIM Sparse-Dense Mode (PIM-SDM). It is proprietary to Cisco routers, and provides a workaround to some of PIM-SM's shortfalls. PIM-SDM mode allows some groups to operate like PIM-DM groups, and other groups to act like PIM-SM groups. It typically is used in combination with Auto-RP to select a designated RP from a pool of candidates. The candidate RPs are configured to advertise their candidacy on a special PIM group, 224.0.1.39, which a pool of routers called Boot Strap Routers listen to and elect the candidate RP with the highest IP address to be the RP. The Bootstrap Routers then multicast the designated RP's address on a special PIM group 224.0.1.40. The groups 224.0.1.39 and 224.0.1.40 are forwarded according to PIM-DM rules, while just virtually all other groups are forwarded in according to PIM-SM rules. All PIM-SDM routers are members of the the 224.0.1.40 group, and use the advertised RP as the RP for the domain. If that RP fails, then another RP is elected, and advertised. While more dynamic than PIM-SM, PIM-SDM is a new protocol, and has relatively few gains over PIM-SM in light of the other workaround available for PIM-SM.

Multicast Design Considerations

Given that the multicast infrastructure will be designed around a PIM-SM scheme, which is preferred over other PIM configurations for its predictability and desirable performance characteristics, several points must be considered before implementation. Namely, these points involve the placement of the RP, rate limits, group filtering and border policies. Several of these considerations are highlighted below.

  • Ideally, the RP should be centrally located in the network, and have a high bandwidth connection to meet the demands of many multi-megabit groups. In the RUNet implementation, this demand can be met by any ATM-connected router.
     
  • Strictly speaking, only one RP can exist for any given group within a PIM-SM design, though this can be worked around given certain nuances in unicast routing protocols. Since RUNet is based on OSPF, multiple, uniquely identifiable RP devices can be configured with the same address on a separate loopback interface, and can synchronize (S,G) state information using Multicast Source Discovery Protocol (MSDP). The unicast routing protocol rules will prefer routes to the topologically closest RP, and will "fail over" to the next topologically closest RP should the closest one fail.
     
  • The unicast network infrastructure provides the multicast service sufficient robustness for any single device failure. PIM bases its routing decisions on the multicast routing table, and responds to topology changes within seconds of the unicast routing protocol's convergence.
  • The RP imports Multicast Backbone (MBONE) data into the University using PIM-SM and MBGP, peering with one of our service providers. Source-group information is exchanged with the provider using MSDP. Certain groups, detailed below, are constrained to within the University's Autonomous System boundary, per Best Common Practice recommendations.
     
  • The network design uses T1 links as backup links in the event of ATM failures. Since T1 links only have 1.5 Mbps of bandwidth, multicast must be treated very carefully across these links. According to the multicast design, multicast services such as SLP and multicast time are allowed to traverse such links, and are rate-limited to 384 kbps.
     
  • By default, most IP firewalls do not pass multicast traffic. This can be worked around by using a DVMRP tunnel through the firewall, with one end of the tunnel on the firewall's router, and the other end of the tunnel on a device running mrouted. Additionally, if the firewall is performing Network Address Translation or Port Address Translation, for RFC 1918 address space, the DVMRP tunnel will prevent the translation of source addresses, causing unpredictable results for SPT construction. Consideration need also be given to the security implications of such a configuration, but is beyond the scope of this document. Firewalls, in general, will prevent the propagation of multicast data, and should be avoided on a network where multicast service is required.

Implementation

IP Multicast traffic is managed at Layer 3 using PIM-SM. PIM-SM is enabled on all building router uplinks, Area Border Router downlinks and ATM core interfaces as of July 1, 2000. PIM-SM is enabled on most user LANs. Rate limits still need to be installed on 10 Mb shared media interfaces. By default, most multicast groups are rate-limited to 768 kbps per group by the routers, so there should be no immediate problem with users swamping their LANs. PIM-SM is enabled the T1 backup links, and has been rate limited to 384 kbps. Filters need to be installed restricting multicast traffic on these interfaces to several well-known groups.

Multicast traffic is managed at Layer 2 where CGMP-capable devices are installed and have CGMP enabled. By default, CGMP is enabled on routers when PIM is enabled. Network Operations has enabled CGMP on all campus distribution routers. CGMP is enabled by default on Cisco Catalyst 2900, 4000, 5000 and 6000 series switches.

Provisions have been made in the ATM core to allow for the efficient transport of multicast traffic. ATM natively supports point-to-multipoint traffic, allowing a special virtual circuit called a Multipoint Switched Virtual Circuit (MSVC) to be set up for each multicast group. Under ATM multipoint signaling, each sender sets up a MSVC to its switch. The switch sets up downstream MSVCs towards the endpoints, and replicates incoming data over the downstream MSVCs. The multipoint trees are overlaid on the ATM infrastructure, allowing each router to be either a sender or receiver in any given tree. Multicast over ATM is described in RFC 2337.

The RP is planned to reside on the transition router between the legacy RUNet and RUNet 2000. The transition gateway presents an ideal place for the RP because it is attached to both the ATM infrastructure, as well as RUNet2000, and because it is an equal distance from all users on either network. Ultimately, one router in each MAN will be configured as the RP for that MAN, with provisions to fail over to the next-closest RP.

Multicast traffic enters and leaves the University via rutgers-gw's connection to the vBNS, subject to multicast filters. BGP has been configured to allow both unicast and multicast routes to be exchanged with its vBNS peer. Rutgers-gw exchanges PIM traffic with the peer, and learns source-group information via MSDP. It also uses MSDP to share source-group information with the configured RPs. The following groups are filtered, and are not allowed to enter or leave the University:

  • 224.0.1.2 - SGI-Dogfight, RFC 1119
  • 224.0.1.3 - Rwhod
  • 224.0.1.22 - SVRLOC
  • 224.0.1.24 - Microsoft Directory Service
  • 224.0.1.35 - Service Location Protocol (SLP), RFC 2165
  • 224.0.1.39 - PIM c-RP multicast group
  • 224.0.1.40 - PIM Auto-RP multicast group
  • 224.0.1.60 - (Unknown as of this writing)
  • 224.0.2.2 - SUN RPC PMAPPROC_CALLIT
  • 239.0.0.0/8 - Private, local multicast address space

From the user's perspective, the only requirements to be met for multicast support are an IP stack supporting IGMPv2 and some flavor of multicast software. Multicast may not work under the following conditions:

  • Multicast not enabled on the router interface
  • The client software is configured with IGMPv1
  • There is a firewall in place
  • There is no source for the group
  • Your primary link is a T1 or other low-bandwidth facility
  • A group in the 239.0.0.0/8 range sourced from outside the University is requested

References

  1. Fenner, RFC 2236, "Internet Group Management Protocol, Version 2," IETF 1997
  2. Farinacci, et al., RFC 2337 , "Intra-LIS IP multicast among routers over ATM using Sparse Mode PIM," IETF 1998
  3. Estrin, et al., RFC 2362 "Protocol Independent Multicast-Sparse Mode (PIM-SM): Protocol Specification," IETF 1998
  4. Meyer, RFC 2365 , "Administratively Scoped IP Multicast," IETF 1998
  5. Finlayson, RFC 2588, "IP Multicast and Firewalls," IETF 1999