jm + keepalive   2

Deadlines, lies and videotape: The tale of a gRPC bug
HostedGraphite decided to use gRPC as an internal inter-service protocol and ran into a basic protocol bug -- it does not default to using an application-level keepalive on the TCP channel so can block indefinitely if sending-side buffers fill up. Always use application-level keepalives and don't trust TCP
tcp  protocols  keepalive  grpc  rpc  architecture  networking 
4 weeks ago by jm
good example of Application-Level Keepalive beating SO_KEEPALIVE
we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.

We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.

Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.


Also notes in the thread that the default TCP timeout for Azure Load Balancer is 4 minutes: https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/ . The default Linux TCP keepalive doesn't send until 2 hours after last connection use, and it's a system-wide sysctl (/proc/sys/net/ipv4/tcp_keepalive_time).

Further, http://networkengineering.stackexchange.com/questions/7207/why-bgp-implements-its-own-keepalive-instead-of-using-tcp-keepalive notes "some firewalls filter TCP keepalives".
tcp  keep-alive  keepalive  protocol  timeouts  zeromq  salt  firewalls  nat 
april 2016 by jm

Copy this bookmark:



description:


tags: