jm + packets + ip + memcached   1

How both TCP and Ethernet checksums fail
At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One "lucky" packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared.

YA occurrence of this bug. When it happens, it tends to _really_ screw things up, because it's so rare -- we had monitoring for this in Amazon, and when it occurred, it overwhelmingly occurred due to host-level kernel/libc/RAM issues rather than stuff in the network. Amazon design principles were to add app-level checksumming throughout, which of course catches the lot.
networking  tcp  ip  twitter  ethernet  checksums  packets  memcached 
october 2015 by jm

Copy this bookmark: