jm + tcp   70

HN thread on the new Network Load Balancer AWS product
looks like @colmmacc works on it. Lots and lots of good details here
nlb  aws  load-balancing  ops  architecture  lbs  tcp  ip 
11 days ago by jm
Spammergate: The Fall of an Empire
Featuring this interesting reactive-block evasion tactic:
In that screenshot, a RCM co-conspirator describes a technique in which the spammer seeks to open as many connections as possible between themselves and a Gmail server. This is done by purposefully configuring your own machine to send response packets extremely slowly, and in a fragmented manner, while constantly requesting more connections.
Then, when the Gmail server is almost ready to give up and drop all connections, the spammer suddenly sends as many emails as possible through the pile of connection tunnels. The receiving side is then overwhelmed with data and will quickly block the sender, but not before processing a large load of emails.


(via Tony Finch)
via:fanf  spam  antispam  gmail  blocklists  packets  tcp  networking 
march 2017 by jm
Service discovery at Stripe
Writeup of their Consul-based service discovery system, a bit similar to smartstack. Good description of the production problems that they saw with Consul too, and also they figured out that strong consistency isn't actually what you want in a service discovery system ;)

HN comments are good too: https://news.ycombinator.com/item?id=12840803
consul  api  microservices  service-discovery  dns  load-balancing  l7  tcp  distcomp  smartstack  stripe  cap-theorem  scalability 
november 2016 by jm
[net-next,14/14] tcp_bbr: add BBR congestion control
This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters. [....]

Signed-off-by: Van Jacobson <vanj@google.com>
google  linux  tcp  ip  congestion-control  bufferbloat  patches  algorithms  rtt  bandwidth  youtube  via:bradfitz 
september 2016 by jm
Push notifications delayed, Hearbeat Interval not reliable - Google Product Forums
Good thread on GCM notifications and their interactions with NAT -- they are delivered over a single TCP connection to port 5228 to the google servers, kept alive, and NAT timeouts can hang the conn resulting in delayed notifications.

Particularly useful is the *#*#426#*#* dial code, which displays a log screen on Android devices with GCM debugging info.
android  gcm  google  push-notifications  nat  tcp 
june 2016 by jm
Linux kernel bug delivers corrupt TCP/IP data to Mesos, Kubernetes, Docker containers — Vijay Pandurangan
Bug in the "veth" driver skips TCP checksums. Reminder: app-level checksums are important
checksums  tcp  veth  ethernet  drivers  linux  kernel  bugs  docker 
april 2016 by jm
Detecting the use of "curl | bash" server side
tl;dr:
The better solution is never to pipe untrusted data streams into bash. If you still want to run untrusted bash scripts a better approach is to pipe the contents of URL into a file, review the contents on disk and only then execute it.
bash  security  shell  unix  curl  tcp  buffers 
april 2016 by jm
good example of Application-Level Keepalive beating SO_KEEPALIVE
we have now about 100 salt-minions which are installed in remote areas with 3G and satellite connections.

We loose connectivity with all of those minions in about 1-2 days after installation, with test.ping reporting "minion did not return". The state was each time that the minions saw an ESTABLISHED TCP connection, while on the salt-master there were no connection listed at all. (Yes that is correct). Tighter keepalive settings were tried with no result. (OS is linux) Each time, restarting the salt-minion fixes the problem immediately.

Obviously the connections are transparently proxied someplace, (who knows what happens with those SAT networks) so the whole tcp-keepalive mechanism of 0mq fails.


Also notes in the thread that the default TCP timeout for Azure Load Balancer is 4 minutes: https://azure.microsoft.com/en-us/blog/new-configurable-idle-timeout-for-azure-load-balancer/ . The default Linux TCP keepalive doesn't send until 2 hours after last connection use, and it's a system-wide sysctl (/proc/sys/net/ipv4/tcp_keepalive_time).

Further, http://networkengineering.stackexchange.com/questions/7207/why-bgp-implements-its-own-keepalive-instead-of-using-tcp-keepalive notes "some firewalls filter TCP keepalives".
tcp  keep-alive  keepalive  protocol  timeouts  zeromq  salt  firewalls  nat 
april 2016 by jm
The revenge of the listening sockets
More adventures in debugging the Linux kernel:
You can't have a very large number of bound TCP sockets and we learned that the hard way. We learned a bit about the Linux networking stack: the fact that LHTABLE is fixed size and is hashed by destination port only. Once again we showed a couple of powerful of System Tap scripts.
ops  linux  networking  tcp  network  lhtable  kernel 
april 2016 by jm
Observability at Twitter: technical overview, part II
Interesting to me mainly for this tidbit which makes my own prejudices:
“Pull” vs “push” in metrics collection: At the time of our previous blog post, all our metrics were collected by “pulling” from our collection agents. We discovered two main issues:

* There is no easy way to differentiate service failures from collection agent failures. Service response time out and missed collection request are both manifested as empty time series.
* There is a lack of service quality insulation in our collection pipeline. It is very difficult to set an optimal collection time out for various services. A long collection time from one single service can cause a delay for other services that share the same collection agent.

In light of these issues, we switched our collection model from “pull” to “push” and increased our service isolation. Our collection agent on each host only collects metrics from services running on that specific host. Additionally, each collection agent sends separate collection status tracking metrics in addition to the metrics emitted by the services.

We have seen a significant improvement in collection reliability with these changes. However, as we moved to self service push model, it becomes harder to project the request growth. In order to solve this problem, we plan to implement service quota to address unpredictable/unbounded growth.
pull  push  metrics  tcp  stacks  monitoring  agents  twitter  fault-tolerance 
march 2016 by jm
Topics in High-Performance Messaging
'We have worked together in the field of high-performance messaging for many years, and in that time, have seen some messaging systems that worked well and some that didn't. Successful deployment of a messaging system requires background information that is not easily available; most of what we know, we had to learn in the school of hard knocks. To save others a knock or two, we have collected here the essential background information and commentary on some of the issues involved in successful deployments. This information is organized as a series of topics around which there seems to be confusion or uncertainty. Please contact us if you have questions or comments.'
messaging  scalability  scaling  performance  udp  tcp  protocols  multicast  latency 
december 2015 by jm
How both TCP and Ethernet checksums fail
At Twitter, a team had a unusual failure where corrupt data ended up in memcache. The root cause appears to have been a switch that was corrupting packets. Most packets were being dropped and the throughput was much lower than normal, but some were still making it through. The hypothesis is that occasionally the corrupt packets had valid TCP and Ethernet checksums. One "lucky" packet stored corrupt data in memcache. Even after the switch was replaced, the errors continued until the cache was cleared.


YA occurrence of this bug. When it happens, it tends to _really_ screw things up, because it's so rare -- we had monitoring for this in Amazon, and when it occurred, it overwhelmingly occurred due to host-level kernel/libc/RAM issues rather than stuff in the network. Amazon design principles were to add app-level checksumming throughout, which of course catches the lot.
networking  tcp  ip  twitter  ethernet  checksums  packets  memcached 
october 2015 by jm
muxy
a proxy that mucks with your system and application context, operating at Layers 4 and 7, allowing you to simulate common failure scenarios from the perspective of an application under test; such as an API or a web application. If you are building a distributed system, Muxy can help you test your resilience and fault tolerance patterns.
proxy  distributed  testing  web  http  fault-tolerance  failure  injection  tcp  delay  resilience  error-handling 
september 2015 by jm
toxy
toxy is a fully programmatic and hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions. It was mainly designed for fuzzing/evil testing purposes, when toxy becomes particularly useful to cover fault tolerance and resiliency capabilities of a system, especially in service-oriented architectures, where toxy may act as intermediate proxy among services.

toxy allows you to plug in poisons, optionally filtered by rules, which essentially can intercept and alter the HTTP flow as you need, performing multiple evil actions in the middle of that process, such as limiting the bandwidth, delaying TCP packets, injecting network jitter latency or replying with a custom error or status code.
toxy  proxies  proxy  http  mitm  node.js  soa  network  failures  latency  slowdown  jitter  bandwidth  tcp 
august 2015 by jm
Benchmarking GitHub Enterprise - GitHub Engineering
Walkthrough of debugging connection timeouts in a load test. Nice graphs (using matplotlib)
github  listen-backlog  tcp  debugging  timeouts  load-testing  benchmarking  testing  ops  linux 
july 2015 by jm
Apple now biases towards IPv6 with a 25ms delay on connections
Interestingly, they claim that IPv6 tends to be more reliable and has lower latency now:
Based on our testing, this makes our Happy Eyeballs implementation go from roughly 50/50 IPv4/IPv6 in iOS 8 and Yosemite to ~99% IPv6 in iOS 9 and El Capitan betas. While our previous implementation from four years ago was designed to select the connection with lowest latency no matter what, we agree that the Internet has changed since then and reports indicate that biasing towards IPv6 is now beneficial for our customers: IPv6 is now mainstream instead of being an exception, there are less broken IPv6 tunnels, IPv4 carrier-grade NATs are increasing in numbers, and throughput may even be better on average over IPv6.
apple  ipv6  ip  tcp  networking  internet  happy-eyeballs  ios  osx 
july 2015 by jm
Improving testing by using real traffic from production
Gor, a very nice-looking tool to log and replay HTTP traffic, specifically designed to "tee" live traffic from production to staging for pre-release testing
gor  performance  testing  http  tcp  packet-capture  tests  staging  tee 
june 2015 by jm
David P. Reed on the history of UDP
'UDP was actually “designed” in 30 minutes on a blackboard when we decided pull the original TCP protocol apart into TCP and IP, and created UDP on top of IP as an alternative for multiplexing and demultiplexing IP datagrams inside a host among the various host processes or tasks. But it was a placeholder that enabled all the non-virtual-circuit protocols since then to be invented, including encapsulation, RTP, DNS, …, without having to negotiate for permission either to define a new protocol or to extend TCP by adding “features”.'
udp  ip  tcp  networking  internet  dpr  history  protocols 
april 2015 by jm
Yelp Product & Engineering Blog | True Zero Downtime HAProxy Reloads
Using tc and qdisc to delay SYNs while haproxy restarts. Definitely feels like on-host NAT between 2 haproxy processes would be cleaner and easier though!
linux  networking  hacks  yelp  haproxy  uptime  reliability  tcp  tc  qdisc  ops 
april 2015 by jm
tcpcopy
"tees" all TCP traffic from one server to another. "widely used by companies in China"!
testing  benchmarking  performance  tcp  ip  tcpcopy  tee  china  regression-testing  stress-testing  ops 
march 2015 by jm
demonstration of the importance of server-side request timeouts
from MongoDB, but similar issues often apply in many other TCP/HTTP-based systems
tcp  http  requests  timeout  mongodb  reliability  safety 
march 2015 by jm
Vaurien, the Chaos TCP Proxy — Vaurien 1.8 documentation
Vaurien is basically a Chaos Monkey for your TCP connections. Vaurien acts as a proxy between your application and any backend. You can use it in your functional tests or even on a real deployment through the command-line.

Vaurien is a TCP proxy that simply reads data sent to it and pass it to a backend, and vice-versa. It has built-in protocols: TCP, HTTP, Redis & Memcache. The TCP protocol is the default one and just sucks data on both sides and pass it along.

Having higher-level protocols is mandatory in some cases, when Vaurien needs to read a specific amount of data in the sockets, or when you need to be aware of the kind of response you’re waiting for, and so on.

Vaurien also has behaviors. A behavior is a class that’s going to be invoked everytime Vaurien proxies a request. That’s how you can impact the behavior of the proxy. For instance, adding a delay or degrading the response can be implemented in a behavior.

Both protocols and behaviors are plugins, allowing you to extend Vaurien by adding new ones.

Last (but not least), Vaurien provides a couple of APIs you can use to change the behavior of the proxy live. That’s handy when you are doing functional tests against your server: you can for instance start to add big delays and see how your web application reacts.
proxy  tcp  vaurien  chaos-monkey  testing  functional-testing  failures  sockets  redis  memcache  http 
february 2015 by jm
How TCP backlog works in Linux
good description of the process
ip  linux  tcp  networking  backlog  ops 
january 2015 by jm
Why we don't use a CDN: A story about SPDY and SSL
All of our assets loaded via the CDN [to our client in Australia] in just under 5 seconds. It only took ~2.7s to get those same assets to our friends down under with SPDY. The performance with no CDN blew the CDN performance out of the water. It is just no comparison. In our case, it really seems that the advantages of SPDY greatly outweigh that of a CDN when it comes to speed.
cdn  spdy  nginx  performance  web  ssl  tls  optimization  multiplexing  tcp  ops 
january 2015 by jm
TCP incast
a catastrophic TCP throughput collapse that occurs as the number of storage servers sending data to a client increases past the ability of an Ethernet switch to buffer packets. In a clustered file system, for example, a client application requests a data block striped across several storage servers, issuing the next data block request only when all servers have responded with their portion (Figure 1). This synchronized request workload can result in packets overfilling the buffers on the client's port on the switch, resulting in many losses. Under severe packet loss, TCP can experience a timeout that lasts a minimum of 200ms, determined by the TCP minimum retransmission timeout (RTOmin).
incast  networking  performance  tcp  bandwidth  buffering  switch  ethernet  capacity 
november 2014 by jm
Axel
A nice curl/wget replacement which supports multi-TCP-connection downloads of HTTP/FTP resources. packaged for most Linux variants and OSX via brew
axel  curl  wget  via:johnke  downloading  tcp  http  ftp  ubuntu  debian  unix  linux 
september 2014 by jm
Chris Baus: TCP_CORK: More than you ever wanted to know
Even with buffered streams the application must be able to instruct the OS to forward all pending data when the stream has been flushed for optimal performance. The application does not know where packet boundaries reside, hence buffer flushes might not align on packet boundaries. TCP_CORK can pack data more effectively, because it has direct access to the TCP/IP layer. [..]

If you do use an application buffering and streaming mechanism (as does Apache), I highly recommend applying the TCP_NODELAY socket option which disables Nagle's algorithm. All calls to write() will then result in immediate transfer of data.
networking  tcp  via:nmaurer  performance  ip  tcp_cork  linux  syscalls  writev  tcp_nodelay  nagle  packets 
september 2014 by jm
'TCP And The Lower Bound of Web Performance' [pdf, slides]
John Rauser, Velocity, June 2010. Good data on real-world web perf based on the limitations which TCP and the speed of light impose
tcp  speed-of-light  performance  web  optimization  john-rauser 
august 2014 by jm
Boundary's new server monitoring free offering
'High resolution, 1 second intervals for all metrics; Fluid analytics, drag any graph to any point in time; Smart alarms to cut down on false positives; Embedded graphs and customizable dashboards; Up to 10 servers for free'

Pre-registration is open now. Could be interesting, although the limit of 10 machines is pretty small for any production usage
boundary  monitoring  network  ops  metrics  alarms  tcp  ip  netstat 
july 2014 by jm
Netflix/ribbon
a client side IPC library that is battle-tested in cloud. It provides the following features:

Load balancing;
Fault tolerance;
Multiple protocol (HTTP, TCP, UDP) support in an asynchronous and reactive model;
Caching and batching.

I like the integration of Eureka and Hystrix in particular, although I would really like to read more about Eureka's approach to availability during network partitions and CAP.

https://groups.google.com/d/msg/eureka_netflix/LXKWoD14RFY/-5nElGl1OQ0J has some interesting discussion on the topic. It actually sounds like the Eureka approach is more correct than using ZK: 'Eureka is available. ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessary consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.'

See also http://ispyker.blogspot.ie/2013/12/zookeeper-as-cloud-native-service.html which corroborates this:

I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances. This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones. What I saw was that the two other instances noticed that the first server “going away”, but they continued to function as they still saw a majority (66%). More interestingly the first instance noticed the other two servers “going away” dropping the ensemble availability to 33%. This caused the first server to stop serving requests to clients (not only writes, but also reads). [...]

To me this seems like a concern, as network partitions should be considered an event that should be survived. In this case (with this specific configuration of zookeeper) no new clients in that availability zone would be able to register themselves with consumers within the same availability zone. Adding more zookeeper instances to the ensemble wouldn’t help considering a balanced deployment as in this case the availability would always be majority (66%) and non-majority (33%).
netflix  ribbon  availability  libraries  java  hystrix  eureka  aws  ec2  load-balancing  networking  http  tcp  architecture  clients  ipc 
july 2014 by jm
Tracedump
a single application IP packet sniffer that captures all TCP and UDP packets of a single Linux process. It consists of the following elements:

* ptrace monitor - tracks bind(), connect() and sendto() syscalls and extracts local port numbers that the traced application uses;
* pcap sniffer - using information from the previous module, it captures IP packets on an AF_PACKET socket (with an appropriate BPF filter attached);
* garbage collector - periodically reads /proc/net/{tcp,udp} files in order to detect the sockets that the application no longer uses.

As the output, tracedump generates a PCAP file with SLL-encapsulated IP packets - readable by eg. Wireshark. This file can be later used for detailed analysis of the networking operations made by the application. For instance, it might be useful for IP traffic classification systems.
debugging  networking  linux  strace  ptrace  tracedump  tracing  tcp  udp  sniffer  ip  tcpdump 
may 2014 by jm
Uplink Latency of WiFi and 4G Networks
It's high. Wifi in particular shows high variability and long latency tails
wifi  3g  4g  mobile  networking  internet  latency  tcp 
april 2014 by jm
Stalled SCP and Hanging TCP Connections
a Cisco fail.
It looks like there’s a firewall in the middle that’s doing additional TCP sequence randomisation which was a good thing, but has been fixed in all current operating systems. Unfortunately, it seems that firewall doesn’t understand TCP SACK, which when coupled with a small amount of packet loss and a stateful host firewall that blocks invalid packets results in TCP connections that stall randomly. A little digging revealed that firewall to be the Cisco Firewall Services Module on our Canterbury network border.


(via Tony Finch)
via:fanf  cisco  networking  firewalls  scp  tcp  hangs  sack  tcpdump 
april 2014 by jm
The little ssh that (sometimes) couldn't - Mina Naguib
A good demonstration of what it looks like when network-level packet corruption occurs on a TCP connection
ssh  sysadmin  networking  tcp  bugs  bit-flips  cosmic-rays  corruption  packet 
april 2014 by jm
Game servers: UDP vs TCP
this HN thread on the age-old UDP vs TCP question is way better than the original post -- lots of salient comments
udp  tcp  games  protocols  networking  latency  internet  gaming  hackernews 
april 2014 by jm
TCP incast vs Riak
An extremely congested local network segment causes the "TCP incast" throughput collapse problem -- packet loss occurs, and TCP throughput collapses as a side effect. So far, this is pretty unsurprising, and anyone designing a service needs to keep bandwidth requirements in mind.

However it gets worse with Riak. Due to a bug, this becomes a serious issue for all clients: the Erlang network distribution port buffers fill up in turn, and the Riak KV vnode process (in its entirety) will be descheduled and 'cannot answer any more queries until the A-to-B network link becomes uncongested.'

This is where EC2's fully-uncontended-1:1-network compute cluster instances come in handy, btw. ;)
incast  tcp  networking  bandwidth  riak  architecture  erlang  buffering  queueing 
february 2014 by jm
Packet Flight: Facebook News Feed @8X
good dataviz of a HTTP page load: 'this is a visualization of a Facebook News Feed load from the perspective of the client, over a 3G wireless connection. Different packet types have different shapes and colors.' (via John Harrington)
via:johnharrington  visualization  facebook  dataviz  networking  tcp  3g 
january 2014 by jm
Chartbeat's Lessons learned tuning TCP and Nginx in EC2
a good writeup of basic sysctl tuning for an internet-facing HTTP proxy fleet running in EC2. Nothing groundbreaking here, but it's well-written
nginx  amazon  ec2  tcp  ip  tuning  sysctl  linux  c10k  ssl  http 
january 2014 by jm
An Empirical Evaluation of TCP Performance in Online Games
In this paper, we have analyzed the performance of TCP in of ShenZhou Online, a commercial, mid-sized MMORPG. Our study indicates that, though TCP is full-fledged and robust, simply transmitting game data over TCP could cause unexpected performance problems. This is due to the following distinctive characteristics of game traffic: 1) tiny packets, 2) low packet rate, 3) application-limited traffic generation, and 4) bi-directional traffic.

We have shown that because TCP was originally designed for unidirectional and network-limited bulk data transfers, it cannot adapt well to MMORPG traffic. In particular, the window-based congestion control mechanism and the fast retransmit algorithm for loss recovery are ineffective. This suggests that the selective acknowledgement option should be enabled whenever TCP is used, as it significantly enhances the loss recovery process. Furthermore, TCP is overkill, as not every game packet needs to be transmitted reliably and processed in an orderly manner. We have also shown that the degraded network performance did impact users' willingness to continue a game. Finally, a number of design guidelines have been proposed by exploiting the unique characteristics of game traffic.


via Nelson
tcp  games  udp  protocols  networking  internet  mmos  retransmit  mmorpgs 
november 2013 by jm
High Performance Browser Networking
slides from Ilya Grigorik's tutorial on the topic at O'Reilly's Velocity conference. lots of good data and tips for internet protocol optimization
slides  presentations  ilya-grigorik  performance  http  https  tcp  tutorials  networking  internet 
november 2013 by jm
Making Storm fly with Netty | Yahoo Engineering
Y! engineer doubles the speed of Storm's messaging layer by replacing the zeromq implementation with Netty
netty  async  zeromq  storm  messaging  tcp  benchmarks  yahoo  clusters 
october 2013 by jm
Apple iOS 7 surprises as first with new multipath TCP connections - Network World
iOS 7 includes -- and uses -- multipath TCP, right now for device-to-Siri communications.
MPTCP is a TCP extension that enables the simultaneous use of several IP addresses or interfaces. Existing applications – completely unmodified -- see what appears to be a standard TCP interface. But under the covers, MPTCP is spreading the connection’s data across several subflows, sending it over the least congested paths.
ios7  ios  networking  apple  mptcp  tcp  protocols  fault-tolerance 
september 2013 by jm
The ultimate SO_LINGER page, or: why is my tcp not reliable
If we look at the HTTP protocol, there data is usually sent with length information included, either at the beginning of an HTTP response, or in the course of transmitting information (so called ‘chunked’ mode). And they do this for a reason. Only in this way can the receiving end be sure it received all information that it was sent. Using the shutdown() technique above really only tells us that the remote closed the connection. It does not actually guarantee that all data was received correctly by program B. The best advice is to send length information, and to have the remote program actively acknowledge that all data was received.
SO_LINGER  sockets  tcp  ip  networking  linux  protocols  shutdown  FIN  RST 
august 2013 by jm
TCP is UNreliable
Great account from Cliff Click describing an interest edge-case risk of using TCP without application-level acking, and how it caused a messy intermittent bug in production.
In all these failures the common theme is that the receiver is very heavily loaded, with many hundreds of short-lived TCP connections being opened/read/closed every second from many other machines.  The sender sends a ‘SYN’ packet, requesting a connection. The sender (optimistically) sends 1 data packet; optimistic because the receiver has yet to acknowledge the SYN packet.  The receiver, being much overloaded, is very slow.  Eventually the receiver returns a ‘SYN-ACK’ packet, acknowledging both the open and the data packet.  At this point the receiver’s JVM has not been told about the open connection; this work is all opening at the OS layer alone.  The sender, being done, sends a ‘FIN’ which it does NOT wait for acknowledgement (all data has already been acknowledged).  The receiver, being heavily overloaded, eventually times-out internally (probably waiting for the JVM to accept the open-call, and the JVM being overloaded is too slow to get around to it) – and sends a RST (reset) packet back…. wiping out the connection and the data.  The sender, however, has moved on – it already sent a FIN & closed the socket, so the RST is for a closed connection.  Net result: sender sent, but the receiver reset the connection without informing either the JVM process or the sender.
tcp  protocols  SO_LINGER  FIN  RST  connections  cliff-click  ip 
august 2013 by jm
Ivan Ristić: Defending against the BREACH attack
One interesting response to this HTTPS compression-based MITM attack:
The award for least-intrusive and entirely painless mitigation proposal goes to Paul Querna who, on the httpd-dev mailing list, proposed to use the HTTP chunked encoding to randomize response length. Chunked encoding is a HTTP feature that is typically used when the size of the response body is not known in advance; only the size of the next chunk is known. Because chunks carry some additional information, they affect the size of the response, but not the content. By forcing more chunks than necessary, for example, you can increase the length of the response. To the attacker, who can see only the size of the response body, but not anything else, the chunks are invisible. (Assuming they're not sent in individual TCP packets or TLS records, of course.) This mitigation technique is very easy to implement at the web server level, which makes it the least expensive option. There is only a question about its effectiveness. No one has done the maths yet, but most seem to agree that response length randomization slows down the attacker, but does not prevent the attack entirely. But, if the attack can be slowed down significantly, perhaps it will be as good as prevented.
mitm  attacks  hacking  security  compression  http  https  protocols  tls  ssl  tcp  chunked-encoding  apache 
august 2013 by jm
Machine Learning Speeds TCP
Cool. A machine-learning-generated TCP congestion control algorithm which handily beats sfqCoDel, Vegas, Reno et al. But:
"Although the [computer-generated congestion control algorithms] appear to work well on networks whose parameters fall within or near the limits of what they were prepared for -- even beating in-network schemes at their own game and even when the design range spans an order of magnitude variation in network parameters -- we do not yet understand clearly why they work, other than the observation that they seem to optimize their intended objective well.

We have attempted to make algorithms ourselves that surpass
the generated RemyCCs, without success. That suggests to us that Remy may have accomplished something substantive. But digging through the dozens of rules in a RemyCC and figuring out their purpose and function is a challenging job in reverse-engineering. RemyCCs designed for broader classes of networks will likely be even more complex, compounding the problem."

So are network engineers willing to trust an algorithm that seems to work but has no explanation as to why it works other than optimizing a specific objective function? As AI becomes increasingly successful the question could also be asked in a wider context.  


(via Bill de hOra)
via-dehora  machine-learning  tcp  networking  hmm  mit  algorithms  remycc  congestion 
july 2013 by jm
Traditional AQM is not enough!
Jim Gettys on modern web design, HTTP, buffering, and FIFO queues in the network.
Web surfing is putting impulses of packets, without congestion avoidance, into FIFO queues where they do severe collateral damage to anything sharing the link (including itself!). So today’s web behavior incurs huge collateral damage on itself, data centers, the edge of the network, and in particular any application that hopes to have real time behavior. How do we solve this problem?


tl;dr: fq_codel. Now I want it!
buffering  networking  internet  web  http  protocols  tcp  bufferbloat  jim-gettys  codel  fq_codel 
july 2013 by jm
An excellent writeup of the TCP bounded-buffer deadlock problem
on pages 146-149 of 'TCP/IP Sockets in C: Practical Guide for Programmers' by Michael J. Donahoo and Kenneth L. Calvert.
tcp  ip  bounded-buffer  deadlock  bugs  buffering  connections  distributed-systems 
july 2013 by jm
the TCP bounded buffer deadlock problem
I've wound up mentioning this twice in the past week, so it's worth digging up and bookmarking!
Under certain circumstances a TCP connection can end up in a "deadlock", where neither the client nor the server is able to write data out or read data in. This is caused by two factors. First, a client or server cannot perform two transactions at once; a read cannot be performed if a write transaction is in progress, and vice versa. Second, the buffers that exist at either end of the TCP connection are of limited size. The deadlock occurs when both the client and server are trying to send an amount of data that is larger than the combined input and output buffer size.
tcp  ip  bounded-buffer  deadlock  bugs  buffering  connections  distributed-systems 
july 2013 by jm
packetdrill - network stack testing tool
[Google's] packetdrill scripting tool enables quick, precise tests for entire TCP/UDP/IPv4/IPv6 network stacks, from the system call layer down to the NIC hardware. packetdrill currently works on Linux, FreeBSD, OpenBSD, and NetBSD. It can test network stack behavior over physical NICs on a LAN, or on a single machine using a tun virtual network device.
testing  networking  tun  google  linux  papers  tcp  ip  udp  freebsd  openbsd  netbsd 
july 2013 by jm
SSL/TLS overhead
'The TLS handshake has multiple variations, but let’s pick the most common one – anonymous client and authenticated server (the connections browsers use most of the time).' Works out to 4 packets, in addition to the TCP handshake's 3, and about 6.5k bytes on average.
network  tls  ssl  performance  latency  speed  networking  internet  security  packets  tcp  handshake 
june 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:

1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.

#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.

There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.

It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.

The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.

For service APIs with real low-latency requirements, this is a great idea.
gc-scout  gc  java  scaling  scalability  linkedin  qcon  async  threadpools  rest  slas  timeouts  networking  distcomp  netty  tcp  udp  failover  fault-tolerance  packet-loss 
june 2013 by jm
Martin Thompson, Luke "Snabb Switch" Gorrie etc. review the C10M presentation from Schmoocon
on the mechanical-sympathy mailing list. Some really interesting discussion on handling insane quantities of TCP connections using low volumes of hardware:
This talk has some good points and I think the subject is really interesting.  I would take the suggested approach with serious caution.  For starters the Linux kernel is nowhere near as bad as it made out.  Last year I worked with a client and we scaled a single server to 1 million concurrent connections with async programming in Java and some sensible kernel tuning.  I've heard they have since taken this to over 5 million concurrent connections.

BTW Open Onload is an open source implementation.  Writing a network stack is a serious undertaking.  In a previous life I wrote a network probe and had to reassemble TCP streams and kept getting tripped up by edge cases.  It is a great exercise in data structures and lock-free programming.  If you need very high-end performance I'd talk to the Solarflare or Mellanox guys before writing my own.

There are some errors and omissions in this talk.  For example, his range of ephemeral ports is not quite right, and atomic operations are only 15 cycles on Sandy Bridge when hitting local cache.  A big issue for me is when he defined C10M he did not mention the TIME_WAIT issue with closing connections.  Creating and destroying 1 million connections per second is a major issue.  A protocol like HTTP is very broken in that the server closes the socket and therefore has to retain the TCB until the specified timeout occurs to ensure no older packet is delivered to a new socket connection.
mechanical-sympathy  hardware  scaling  c10m  tcp  http  scalability  snabb-switch  martin-thompson 
may 2013 by jm
TCP Tune
These notes are intended to help users and system administrators maximize TCP/IP performance on their computer systems. They summarize all of the end-system (computer system) network tuning issues including a tutorial on TCP tuning, easy configuration checks for non-experts, and a repository of operating system specific instructions for getting the best possible network performance on these platforms.


Some tips for maximizing HPC network performance for the intra-DC case; recommended by the LinkedIn Kafka operations page.
tuning  network  tcp  sysadmin  performance  ops  kafka  ec2 
april 2013 by jm
IOS TCP wifi optimizer
Basically, tweaking a few suboptimal sysctls to optimize for 802.11b/n; requires a Jailbroken IOS device. I'm surprised that Apple defaulted segment size to 512 to be honest, and disabling delayed ACKs sounds like it might be useful (see also http://www.stuartcheshire.org/papers/NagleDelayedAck/).
TCP optimizer modifies a few settings inside iOS, including increasing the TCP receive buffer from 131072 to 292000, disabling TCP delayed ACK’s, allowing a maximum of 16 un-ACK’d packets instead of 8 and set the default packet size to 1460 instead of 512. These changes won’t only speed up your YouTube videos, they’ll also improve your internet connection’s performance overall, including Wi-Fi network connectivity.
tcp  performance  tuning  ios  apple  wifi  wireless  802.11n  sysctl  ip 
february 2013 by jm
Passively Monitoring Network Round-Trip Times - Boundary
'how Boundary uses [TCP timestamps] to calculate round-trip times (RTTs) between any two hosts by passively monitoring TCP traffic flows, i.e., without actively launching ICMP echo requests (pings). The post is primarily an overview of this one aspect of TCP monitoring, it also outlines the mechanism we are using, and demonstrates its correctness.'
tcp  boundary  monitoring  network  ip  passive-monitoring  rtt  timestamping 
february 2013 by jm
High performance network programming on the JVM, OSCON 2012
by Erik Onnen of Urban Airship. very good presentation on the current state of the art in large-scale low-latency service operation using the JVM on Linux. Lots of good details on async vs sync, HTTPS/TLS/TCP tuning, etc.
http  https  scaling  jvm  async  sync  oscon  presentations  tcp 
july 2012 by jm
Why upgrading your Linux Kernel will make your customers much happier
enabling TCP Slow Start on the HTTP server-side decreased internet round-trip page load time by 21% in this case; comments suggest an "ip route" command can also work
tcp  performance  linux  network  web  http  rtt  slow-start  via:jacob 
march 2012 by jm
BufferBloat: What's Wrong with the Internet? - ACM Queue
'A discussion with Vint Cerf, Van Jacobson, Nick Weaver, and Jim Gettys' -- the big guns! Great discussion (via Tony Finch)
via:fanf  bufferbloat  networking  buffers  buffering  performance  load  tcp  ip 
december 2011 by jm
Determining response times with tcprstat
'Tcprstat is a free, open-source TCP analysis tool that watches network traffic and computes the delay between requests and responses. From this it derives response-time statistics and prints them out.' Computes percentiles, too
tcp  tcprstat  tcp-ip  networking  measurement  statistics  performance  instrumentation  linux  unix  tools  cli 
november 2011 by jm
apenwarr/sshuttle - GitHub
'Any TCP session you initiate to one of the proxied IP addresses [specified on the command line] will be captured by sshuttle and sent over an ssh session to the remote copy of sshuttle, which will then regenerate the connection on that end, and funnel the data back and forth through ssh. Fun, right? A poor man's instant VPN, and you don't even have to have admin access on the server.'
vpn  ssh  security  linux  opensource  tcp  networking  tunnelling  port-forwarding  from delicious
january 2011 by jm
Jim Gettys and a star-studded cast explain the 'bufferbloat' problem breaking TCP/IP on modern consumer broadband
'the [large] buffers are confusing TCP’s RTT estimator; the delay caused by the buffers is many times the actual RTT on the path.' [..] 'by inserting big buffers into the network, we have violated the design presumption of all Internet congestion avoiding protocols: that the network will drop packets in a timely fashion.'  QoS traffic shaping avoids this -- hooray for Tomato firmware
jim-gettys  via:glen-gray  buffering  tcp  ip  internet  broadband  routers  from delicious
december 2010 by jm
tcpcrypt
opportunistic encryption of TCP connections. not the simplest to set up, though
cryptography  encryption  tcp  security  internet  tcpcrypt  opportunistic  from delicious
august 2010 by jm
Overclocking SSL
techie details from Adam Langley on how Google's been improving TLS/SSL, with lots of good tips. they switched in January to HTTPS for all Gmail users by default, without any additional machines or hardware
certificates  encryption  google  https  latency  speed  ssl  tcp  tls  web  performance  from delicious
july 2010 by jm
pwnat - NAT to NAT client-server communication
'a proxy server that works behind a NAT, even when the client is behind a NAT, without any 3rd party'. nifty, by Samy "MySpace worm" Kamkar
samy-kamkar  apps  firewall  ip  nat  networking  pwnat  stun  traversal  tcp  sysadmin  tunneling  udp  from delicious
march 2010 by jm

related tags

3g  4g  802.11n  ack  aeron  agents  alarms  algorithms  amazon  android  antispam  apache  api  apple  apps  architecture  async  attacks  availability  aws  axel  backlog  bandwidth  bash  benchmarking  benchmarks  bit-flips  blocklists  boundary  bounded-buffer  broadband  bufferbloat  buffering  buffers  bugs  c10k  c10m  cap-theorem  capacity  cdn  certificates  chaos-monkey  checksums  china  chunked-encoding  cisco  cli  clients  cliff-click  clusters  codel  compression  congestion  congestion-control  connections  consul  corruption  cosmic-rays  cryptography  curl  dataviz  deadlock  debian  debugging  delay  distcomp  distributed  distributed-systems  dns  docker  downloading  dpr  drivers  ec2  encryption  erlang  error-handling  ethernet  eureka  facebook  failover  failure  failures  fault-tolerance  FIN  firewall  firewalls  fq_codel  freebsd  ftp  functional-testing  games  gaming  gc  gc-scout  gcm  github  gmail  google  gor  hackernews  hacking  hacks  handshake  hangs  happy-eyeballs  haproxy  hardware  history  hmm  http  https  hystrix  ilya-grigorik  incast  injection  instrumentation  internet  ios  ios7  ip  ipc  ipv6  java  jim-gettys  jitter  john-rauser  jvm  kafka  keep-alive  keepalive  kernel  l7  latency  lbs  lhtable  libraries  linkedin  linux  listen-backlog  load  load-balancing  load-testing  machine-learning  martin-thompson  measurement  mechanical-sympathy  memcache  memcached  messaging  messing  metrics  microservices  mit  mitm  mmorpgs  mmos  mobile  mongodb  monitoring  mptcp  multicast  multiplexing  nagle  nat  netbsd  netflix  netstat  netty  network  networking  nginx  nlb  node.js  openbsd  opensource  opportunistic  ops  optimization  oscon  osx  packet  packet-capture  packet-loss  packets  papers  passive-monitoring  patches  performance  port-forwarding  presentations  protocol  protocols  proxies  proxy  ptrace  pull  push  push-notifications  pwnat  qcon  qdisc  queueing  redis  regression-testing  reliability  remycc  requests  resilience  rest  retransmit  riak  ribbon  routers  RST  rtt  ruby  sack  safety  salt  samy-kamkar  scalability  scaling  scp  security  service-discovery  shell  shutdown  slas  slides  slow-start  slowdown  smartstack  snabb-switch  sniffer  soa  sockets  SO_LINGER  spam  spdy  speed  speed-of-light  ssh  ssl  stacks  staging  statistics  storm  strace  stress-testing  stripe  stun  switch  sync  sysadmin  syscalls  sysctl  sysctls  tc  tcp  tcp-ip  tcpcopy  tcpcrypt  tcpdump  tcprstat  tcp_cork  tcp_nodelay  tee  testing  tests  threadpools  time-wait  timeout  timeouts  timestamping  tls  tools  toxy  tracedump  tracing  traversal  tun  tuning  tunneling  tunnelling  tutorials  twitter  ubuntu  udp  unix  uptime  vaurien  veth  via-dehora  via:bradfitz  via:fanf  via:glen-gray  via:jacob  via:johnharrington  via:johnke  via:nmaurer  visualization  vpn  web  wget  wifi  wireless  writev  yahoo  yelp  youtube  zeromq 

Copy this bookmark:



description:


tags: