1076
1074
options are specified more than once on the same mount command line,
1077
1075
then the value of the rightmost instance of each of these options
1077
.SS "Using NFS over UDP on high-speed links"
1078
Using NFS over UDP on high-speed links such as Gigabit
1079
.BR "can cause silent data corruption" .
1081
The problem can be triggered at high loads, and is caused by problems in
1082
IP fragment reassembly. NFS read and writes typically transmit UDP packets
1083
of 4 Kilobytes or more, which have to be broken up into several fragments
1084
in order to be sent over the Ethernet link, which limits packets to 1500
1085
bytes by default. This process happens at the IP network layer and is
1086
called fragmentation.
1088
In order to identify fragments that belong together, IP assigns a 16bit
1090
value to each packet; fragments generated from the same UDP packet
1091
will have the same IP ID. The receiving system will collect these
1092
fragments and combine them to form the original UDP packet. This process
1093
is called reassembly. The default timeout for packet reassembly is
1094
30 seconds; if the network stack does not receive all fragments of
1095
a given packet within this interval, it assumes the missing fragment(s)
1096
got lost and discards those it already received.
1098
The problem this creates over high-speed links is that it is possible
1099
to send more than 65536 packets within 30 seconds. In fact, with
1100
heavy NFS traffic one can observe that the IP IDs repeat after about
1103
This has serious effects on reassembly: if one fragment gets lost,
1105
.I from a different packet
1108
will arrive within the 30 second timeout, and the network stack will
1109
combine these fragments to form a new packet. Most of the time, network
1110
layers above IP will detect this mismatched reassembly - in the case
1111
of UDP, the UDP checksum, which is a 16 bit checksum over the entire
1112
packet payload, will usually not match, and UDP will discard the
1115
However, the UDP checksum is 16 bit only, so there is a chance of 1 in
1116
65536 that it will match even if the packet payload is completely
1117
random (which very often isn't the case). If that is the case,
1118
silent data corruption will occur.
1120
This potential should be taken seriously, at least on Gigabit
1122
Network speeds of 100Mbit/s should be considered less
1123
problematic, because with most traffic patterns IP ID wrap around
1124
will take much longer than 30 seconds.
1126
It is therefore strongly recommended to use
1127
.BR "NFS over TCP where possible" ,
1128
since TCP does not perform fragmentation.
1130
If you absolutely have to use NFS over UDP over Gigabit Ethernet,
1131
some steps can be taken to mitigate the problem and reduce the
1132
probability of corruption:
1135
Many Gigabit network cards are capable of transmitting
1136
frames bigger than the 1500 byte limit of traditional Ethernet, typically
1137
9000 bytes. Using jumbo frames of 9000 bytes will allow you to run NFS over
1138
UDP at a page size of 8K without fragmentation. Of course, this is
1139
only feasible if all involved stations support jumbo frames.
1141
To enable a machine to send jumbo frames on cards that support it,
1142
it is sufficient to configure the interface for a MTU value of 9000.
1144
.I Lower reassembly timeout:
1145
By lowering this timeout below the time it takes the IP ID counter
1146
to wrap around, incorrect reassembly of fragments can be prevented
1147
as well. To do so, simply write the new timeout value (in seconds)
1149
.BR /proc/sys/net/ipv4/ipfrag_time .
1151
A value of 2 seconds will greatly reduce the probability of IPID clashes on
1152
a single Gigabit link, while still allowing for a reasonable timeout
1153
when receiving fragmented traffic from distant peers.
1079
1154
.SH "DATA AND METADATA COHERENCE"
1080
1155
Some modern cluster file systems provide
1081
1156
perfect cache coherence among their clients.