Re: deficiency in pbuf architecture? Author: Van Jacobson Date: 1997/03/27 Forum: comp.protocols.tcp-ip In article <3339764B.41C67EA6@nyx.pr.mcs.net>, ccsanady@nyx.pr.mcs.net writes: > I'm curious what happens when an incoming multicast packet is destined > for more than one socket. Or an outgoing one for several interfaces. > With pbufs belonging to a particular device, refrence counts don't seem > to make sense. So, how does Van Jacobson get around this without > copying the entire packet and queuing it for each socket? Chris, The intent of pbufs is to allow a device to express & enforce its special requirements on packet buffer memory (such as use of on-board memory, dma or header alignment requirements, limitted dma address range, etc.). But this still allows a "generic" device (one with no special buffer requirements) to use a (shared) generic allocator. In my test implementation, devices that had special requirements filled in the if_pget & if_pmove slots of their ifnet struct then if_attach() filled those slots with the addresses of "generic" pget & pmove routines if they were still zero (this had the side effect of allowing most BSD drivers to be ported to the new architecture without change then incrementally upgraded to take advantage of the new features). The if_pmove pbuf routine is responsible for `copying' packets (I left it out of my pbuf summary slide but there's an example of its use in the ip forwarding code on pg. 26). It takes a pbuf & returns a pbuf & its semantics are `convert the input pbuf into one that can be sent on this interface'. The `generic' pmove routine looks something like: struct pbuf* generic_pmove(struct ifnet* ifp, struct pbuf* p) { if (p->p_free != generic_pfree) { /* isn't a generic packet -- do a pget/bcopy/pfree */ ... } return (p); } so there's no copying in the "usual" case. Multicast is slightly complicated by the fact that a pbuf includes the link-level header & that header is different on every outgoing copy. This means that a reference-counting allocator still wouldn't help & you're stuck with either trying to use dma chaining (which greatly complicates the architecture & hardly ever works for things as short as link-level headers) or serializing the outputs. My test kernel more-or-less did the latter. In detail, the normal routing lookups caused multicast to go to a special multicast pseudo-interface & the route struct passed to that interface's output routine contained the list of actual output interfaces & route struct (containing that interface's link level header) pointer for each. The list was in two pieces -- all the interfaces with non-generic pget routines then all the ones with generic pgets. The pseudo-interface output routine spun through all the entries on the first list doing a pget/copy/output (since this list, by definition, requires a packet copy). Then it built the output packet for the first 'generic' interface & handed it a pbuf with the 'p_free' pointer set to a routine in the pseudo-interface driver. When this free routine was called (i.e., on the packet output completion interrupt) it called the output routine of the next interface on the list. For the last interface on the generic list, the original (generic) p_free pointer was restored in the pbuf so the final output completion resulted in the packet being freed. Obviously, the interfaces in the generic list were ordered by output bandwidth to minimize the average serialization delay. Delivery to local sockets was handled by another pseudo-interface whose route struct contained a list of sockets subscribed to the group (this pseudo-interface could be included in the list of interfaces handed to the packet forwarder pseudo-interface in the case where there were both local members & multicast packet forwarding). If there was more than one listening socket, this pseudo-interface would make up a new pbuf header pointing to the content of the original pbuf but with the free pointer aimed at a routine in the pseudo-interface that counted the p_free's from the socket level receive routines & only freed the original pbuf when all the listening sockets had picked up its content. Hope this answers the questions. Cheers. - Van