IP over IP tunnels & MTU problems

'MTU' stands for 'Maximum Transfer Unit' which is a rather unwieldy way of saying 'the maximum amount of data that can be fitted in a packet'.

Note that 'packet' in this case is the layer 2 packet, e.g. an Ethernet frame, or a PPP packet, so the MTU size is total size of the IP packet, including the IP and UDP or TCP headers.

So a link with a 1500 Byte MTU can carry 1500 byte IP packets, or about 1452 bytes of user data in a TCP packet.

This leads us to one immediate problem: With all the Internet's complexities it's possible for a packet to pass over links with different MTU's. If it started out over a link with a large MTU (e.g. 1500 for Ethernet) and passes over a link with a smaller MTU the packet your sending can be too big to fit.

Luckily there is a mechanism in IP to get around this: fragmentation. A router will recognize when a packet is too big to fit down whatever link it's trying to send it down, and split it into pieces ('fragments') and then send the pieces. The router at the other end reassembles the pieces and passes them on.

This is fine until you consider that sending two packets is not as efficient as sending one, and that the router at the far end will need to keep hold of all the pieces of the packet received so far until it's got all of them and can reassemble the packet. This causes extra load on the routers CPU and memory, and all that fragmentation and reassembly adds extra latency.

Due to this performance hit the IETF created a mechanism where by hosts on either side of a network could determine the largest packet size they could send through whatever network lies in between them.

This mechanism is called Path MTU Discovery, normally just called PMTU, or PMTU Disc, and it's described in RFC1191.

It works by setting a bit in the IP header of all packets that a host sends called the Don't Fragment bit (DF). Any router that wants to fragment these packets to fit them down a smaller link will see that the DF bit has been sent and refuse to send the packet on any further.

To keep the host that sent the packet informed of whats going on the router sends back an 'error message' saying that the router dropped the packet.

This 'error message' is in the form of an Internet Control Message Protocol (ICMP) packet, type 3 (unreachable), subtype 4 (needs fragmentation).

The host that sent the original packet then sees the ICMP error message and resends a smaller packet.

All this is fine in theory, but in practice there is a fairly serious problem. ICMP messages are used for many things - ICMP Echo Requests, and ICMP Echo Reply are used by ping to see if a host exists, ICMP Time Exceeded In Transit is used by traceroute/trace/tracrt to see how many routers there are between two hosts, and there are a bunch of others, some of which do rather dodgy things like finding out the netmask of a remote host, or what a remote host thinks the time is.

All these different types of ICMP worry network administrators, and so some of the more clueless ones have started blocking ALL ICMP packets at there firewalls, breaking Path MTU discovery.

Now consider ip over ip tunnels - when trying to fit a 1500 byte packet into an ip packet running over a link which already has a 1500 byte MTU the original packet will have to be fragmented to leave room for the lower levels ip headers.

When you consider that one popular (tho rather cheesy) VPN method in the open source world is to run IP over PPP over SSH then you have multiple layers of encapsulation with associated fragmentation and reassembly. Microsoft's PPTP is also IP over PPP over an encrypted layer, over TCP, and has similar problems.

Given that consume will almost certainly make some use of ip over ip tunnels we're going to have to be careful to keep MTU's as high as possible. We may also want to configure the DHCP servers on nodes to tell clients to use a lower MTU. (with the option interface-mtu option)

See also: http://sites.inka.de/sites/bigred/devel/tcp-tcp.html.