jm + kernel   14

Ubuntu on AWS gets serious performance boost with AWS-tuned kernel
interesting -- faster boots, CPU throttling resolved on t2.micros, other nice stuff
aws  ubuntu  ec2  kernel  linux  ops 
17 days ago by jm
Linux kernel bug delivers corrupt TCP/IP data to Mesos, Kubernetes, Docker containers — Vijay Pandurangan
Bug in the "veth" driver skips TCP checksums. Reminder: app-level checksums are important
checksums  tcp  veth  ethernet  drivers  linux  kernel  bugs  docker 
april 2016 by jm
The revenge of the listening sockets
More adventures in debugging the Linux kernel:
You can't have a very large number of bound TCP sockets and we learned that the hard way. We learned a bit about the Linux networking stack: the fact that LHTABLE is fixed size and is hashed by destination port only. Once again we showed a couple of powerful of System Tap scripts.
ops  linux  networking  tcp  network  lhtable  kernel 
april 2016 by jm
The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future
Awesome post from Dan Luu with data from Google:
The cause [of some mystery widespread 250ms hangs] was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren’t too many requests in a quarter second interval and the kernel stops throttling the threads. After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn’t something you can see in a profile.
debugging  performance  visualization  instrumentation  metrics  dan-luu  latency  google  dick-sites  linux  scheduler  throttling  kernel  hangs 
february 2016 by jm
Linux futex_wait() bug
major bug in kernel versions 3.14 - 3.18 on Haswell hardware
haswell  linux  futex_wait  futexes  kernel  bugs  hang 
may 2015 by jm
The Discovery of Apache ZooKeeper's Poison Packet - PagerDuty
Excellent deep dive into a production issue. Root causes: crappy error handling code in Zookeeper; lack of bounds checking in ZK; and a nasty kernel bug.
zookeeper  bugs  error-handling  bounds-checking  oom  poison-packets  pagerduty  packets  tcpdump  xen  aes  linux  kernel 
may 2015 by jm
BPF - the forgotten bytecode
'In essence Tcpdump asks the kernel to execute a BPF program within the kernel context. This might sound risky, but actually isn't. Before executing the BPF bytecode kernel ensures that it's safe:

* All the jumps are only forward, which guarantees that there aren't any loops in the BPF program. Therefore it must terminate.
* All instructions, especially memory reads are valid and within range.
* The single BPF program has less than 4096 instructions.

All this guarantees that the BPF programs executed within kernel context will run fast and will never infinitely loop. That means the BPF programs are not Turing complete, but in practice they are expressive enough for the job and deal with packet filtering very well.'

Good example of a carefully-designed DSL allowing safe "programs" to be written and executed in a privileged context without security risk, or risk of running out of control.
coding  dsl  security  via:oisin  linux  tcpdump  bpf  bsd  kernel  turing-complete  configuration  languages 
may 2014 by jm
Response to "Optimizing Linux Memory Management..."
A follow up to the LinkedIn VM-tuning blog post at http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases --
Do not read in to this article too much, especially for trying to understand how the Linux VM or the kernel works.  The authors misread the "global spinlock on the zone" source code and the interpretation in the article is dead wrong.
linux  tuning  vm  kernel  linkedin  memory  numa 
october 2013 by jm
Rusty's API Design Manifesto
This classic came up in discussions yesterday...

In the Linux Kernel community Rusty Russell came up with a API rating scheme to help us determine if our API is sensible, or not.  It's a rating from -10 to 10, where 10 is perfect is -10 is hell. Unfortunately there are too many examples at the wrong end of the scale.
rusty-russell  quality  coding  kernel  linux  apis  design  code-reviews  code 
may 2013 by jm
Microsoft's ill-chosen magic constants
'Paolo Bonzini noticed something a little awkward in the Linux kernel support code for Microsoft's HyperV virtualisation environment - specifically, that the magic constant passed through to the hypervisor was "0xB16B00B5", or, in English, "BIG BOOBS". It turns out that this isn't an exception - when the code was originally submitted it also contained "0x0B00B135".' me, I prefer my magic constants less offensive and more Subgenius-oriented: "0xB0BD0BB5"
constants  via:kevin-lyda  oh-dear  microsoft  fail  magic-numbers  boobs  linux  kernel 
july 2012 by jm
Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks — PNAS
'we present a comparison between the transcriptional regulatory network of a well-studied bacterium (E. coli) and the call graph of a canonical OS (Linux) in terms of topology and evolution. ... both networks have a fundamentally hierarchical layout, but there is a key difference: The transcriptional regulatory network possesses a few global regulators at the top and many targets at the bottom; conversely, the call graph has many regulators controlling a small set of generic functions. This top-heavy organization leads to highly overlapping functional modules in the call graph, in contrast to the relatively independent modules in the regulatory network. ... These findings stem from the design principles of the two systems: robustness for biological systems and cost effectiveness (reuse) for software systems.' (via adulau)
via:adulau  papers  toread  genetics  genome  call-graph  linux  kernel  e-coli  operating-systems  transcriptional-regulatory-network  from delicious
may 2010 by jm
KS2009: How Google uses Linux [LWN.net]
Google resync to the latest kernel every 17 months or so -- not bad, actually
google  linux  kernel  open-source  gpl  free-software  from delicious
october 2009 by jm

Copy this bookmark:



description:


tags: