jm + performance   206

poor man's profiler
'Sampling tools like oprofile or dtrace's profile provider don't really provide methods to see what [multithreaded] programs are blocking on - only where they spend CPU time. Though there exist advanced techniques (such as systemtap and dtrace call level probes), it is overkill to build upon that. Poor man doesn't have time. Poor man needs food.'

Basically periodically grabbing stack traces from running processes using gdb.
gdb  profiling  linux  unix  mark-callaghan  stack-traces  performance 
6 weeks ago by jm
Locking, Little's Law, and the USL
Excellent explanatory mailing list post by Martin Thompson to the mechanical-sympathy group, discussing Little's Law vs the USL:
Little's law can be used to describe a system in steady state from a queuing perspective, i.e. arrival and leaving rates are balanced. In this case it is a crude way of modelling a system with a contention percentage of 100% under Amdahl's law, in that throughput is one over latency.

However this is an inaccurate way to model a system with locks. Amdahl's law does not account for coherence costs. For example, if you wrote a microbenchmark with a single thread to measure the lock cost then it is much lower than in a multi-threaded environment where cache coherence, other OS costs such as scheduling, and lock implementations need to be considered.

Universal Scalability Law (USL) accounts for both the contention and the coherence costs.
http://www.perfdynamics.com/Manifesto/USLscalability.html

When modelling locks it is necessary to consider how contention and coherence costs vary given how they can be implemented. Consider in Java how we have biased locking, thin locks, fat locks, inflation, and revoking biases which can cause safe points that bring all threads in the JVM to a stop with a significant coherence component.
usl  scaling  scalability  performance  locking  locks  java  jvm  amdahls-law  littles-law  system-dynamics  modelling  systems  caching  threads  schedulers  contention 
8 weeks ago by jm
How to Optimize Garbage Collection in Go
In this post, we’ll share a few powerful optimizations that mitigate many of the performance problems common to Go’s garbage collection (we will cover “fun with deadlocks” in a follow-up). In particular, we’ll share how embedding structs, using sync.Pool, and reusing backing arrays can minimize memory allocations and reduce garbage collection overhead.
garbage  performance  gc  golang  go  coding 
10 weeks ago by jm
AdoptOpenJDK/jitwatch
Log analyser and visualiser for the HotSpot JIT compiler. Inspect inlining decisions, hot methods, bytecode, and assembly. View results in the JavaFX user interface.
analysis  java  jvm  performance  tools  debugging  optimization  jit 
12 weeks ago by jm
Linux Load Averages: Solving the Mystery
Nice bit of OS archaeology by Brendan Gregg.
In 1993, a Linux engineer found a nonintuitive case with load averages, and with a three-line patch changed them forever from "CPU load averages" to what one might call "system load averages." His change included tasks in the uninterruptible state, so that load averages reflected demand for disk resources and not just CPUs. These system load averages count the number of threads working and waiting to work, and are summarized as a triplet of exponentially-damped moving sum averages that use 1, 5, and 15 minutes as constants in an equation. This triplet of numbers lets you see if load is increasing or decreasing, and their greatest value may be for relative comparisons with themselves.
load  monitoring  linux  unix  performance  ops  brendan-gregg  history  cpu 
august 2017 by jm
consistent hashing with bounded loads
'an algorithm that combined consistent hashing with an upper limit on any one server’s load, relative to the average load of the whole pool.'

Lovely blog post from Vimeo's eng blog on a new variation on consistent hashing -- incorporating a concept of overload-avoidance -- and adding it to HAProxy and using it in production in Vimeo. All sounds pretty nifty! (via Toby DiPasquale)
via:codeslinger  algorithms  networking  performance  haproxy  consistent-hashing  load-balancing  lbs  vimeo  overload  load 
august 2017 by jm
EBS gp2 I/O BurstBalance exhaustion
when EBS volumes in EC2 exhaust their "burst" allocation, things go awry very quickly
performance  aws  ebs  ec2  burst-balance  ops  debugging 
july 2017 by jm
Fastest syncing of S3 buckets
good tip for "aws s3 sync" performance
performance  aws  s3  copy  ops  tips 
july 2017 by jm
Top 5 ways to improve your AWS EC2 performance
A couple of bits of excellent advice from Datadog (although this may be a slightly old post, from Oct 2016):

1. Unpredictable EBS disk I/O performance. Note that gp2 volumes do not appear to need as much warmup or priming as before.

2. EC2 Instance ECU Mismatch and Stolen CPU. advice: use bigger instances

The other 3 ways are a little obvious by comparison, but worth bookmarking for those two anyway.
ops  ec2  performance  datadog  aws  ebs  stolen-cpu  virtualization  metrics  tips 
july 2017 by jm
usl4j And You | codahale.com
Coda Hale wrote a handy java library implementing a USL solver
usl  scalability  java  performance  optimization  benchmarking  measurement  ops  coda-hale 
june 2017 by jm
don't use String.intern() in Java
String.intern is the gateway to native JVM String table, and it comes with caveats: throughput, memory footprint, pause time problems will await the users. Hand-rolled deduplicators/interners to reduce memory footprint are working much more reliably, because they are working on Java side, and also can be thrown away when done. GC-assisted String deduplication does alleviate things even more. In almost every project we were taking care of, removing String.intern from the hotpaths was the very profitable performance optimization. Do not use it without thinking, okay?
strings  interning  java  performance  tips 
may 2017 by jm
Amazon DynamoDB Accelerator (DAX)
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.


No latency percentile figures, unfortunately. Also still in preview.
amazon  dynamodb  aws  dax  performance  storage  databases  latency  low-latency 
april 2017 by jm
The Occasional Chaos of AWS Lambda Runtime Performance
If our code has modest resource requirements, and can tolerate large changes in performance, then it makes sense to start with the least amount of memory necessary. On the other hand, if consistency is important, the best way to achieve that is by cranking the memory setting all the way up to 1536MB.
It’s also worth noting here that CPU-bound Lambdas may be cheaper to run over time with a higher memory setting, as Jim Conning describes in his article, “AWS Lambda: Faster is Cheaper”. In our tests, we haven’t seen conclusive evidence of that behavior, but much more data is required to draw any strong conclusions.
The other lesson learned is that Lambda benchmarks should be gathered over the course of days, not hours or minutes, in order to provide actionable information. Otherwise, it’s possible to see very impressive performance from a Lambda that might later dramatically change for the worse, and any decisions made based on that information will be rendered useless.
aws  lambda  amazon  performance  architecture  ops  benchmarks 
march 2017 by jm
Testing Docker multi-host network performance - Percona Database Performance Blog
wow, Docker Swarm looks like a turkey right now if performance is important. Only "host" gives reasonably perf numbers
docker  networking  performance  ops  benchmarks  testing  swarm  overlay  calico  weave  bridge 
november 2016 by jm
MemC3: Compact and concurrent Memcache with dumber caching and smarter hashing
An improved hashing algorithm called optimistic cuckoo hashing, and a CLOCK-based eviction algorithm that works in tandem with it. They are evaluated in the context of Memcached, where combined they give up to a 30% memory usage reduction and up to a 3x improvement in queries per second as compared to the default Memcached implementation on read-heavy workloads with small objects (as is typified by Facebook workloads).
memcached  performance  key-value-stores  storage  databases  cuckoo-hashing  algorithms  concurrency  caching  cache-eviction  memory  throughput 
november 2016 by jm
Measuring Docker IO overhead - Percona Database Performance Blog
See also https://www.percona.com/blog/2016/02/05/measuring-docker-cpu-network-overhead/ for the CPU/Network equivalent. The good news is that nowadays it's virtually 0 when the correct settings are used
docker  percona  overhead  mysql  deployment  performance  ops  containers 
november 2016 by jm
How to Quantify Scalability
good page on the Universal Scalability Law and how to apply it
usl  performance  scalability  concurrency  capacity  measurement  excel  equations  metrics 
september 2016 by jm
USE Method: Linux Performance Checklist
Really late in bookmarking this, but has some up-to-date sample commandlines for sar, mpstat and iostat on linux
linux  sar  iostat  mpstat  cli  ops  sysadmin  performance  tuning  use  metrics 
june 2016 by jm
Koloboke Collections
Interesting new collections lib for Java 6+; generates Map-like and Set-like collections at runtime based on the contract annotations you desire. Fat (20MB) library-based implementation also available
collections  java  koloboke  performance  coding 
may 2016 by jm
raboof/nethogs: Linux 'net top' tool
NetHogs is a small 'net top' tool. Instead of breaking the traffic down per protocol or per subnet, like most tools do, it groups bandwidth by process.
nethogs  cli  networking  performance  measurement  ops  linux  top 
may 2016 by jm
Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark
[LinkedIn] are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows.


neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
linkedin  oss  hadoop  spark  performance  tuning  ops 
april 2016 by jm
Gil Tene on benchmarking
'I would strongly encourage you to avoid repeating the mistakes of testing methodologies that focus entirely on max achievable throughput and then report some (usually bogus) latency stats at those max throughout modes. The techempower numbers are a classic example of this in play, and while they do provide some basis for comparing a small aspect of behavior (what I call the "how fast can this thing drive off a cliff" comparison, or "pedal to the metal" testing), those results are not very useful for comparing load carrying capacities for anything that actually needs to maintain some form of responsiveness SLA or latency spectrum requirements.'

Some excellent advice here on how to measure and represent stack performance.

Also: 'DON'T use or report standard deviation for latency. Ever. Except if you mean it as a joke.'
performance  benchmarking  testing  speed  gil-tene  latency  measurement  hdrhistogram  load-testing  load 
april 2016 by jm
BLAKE2: simpler, smaller, fast as MD5
'We present the cryptographic hash function BLAKE2, an improved version
of the SHA-3 finalist BLAKE optimized for speed in software. Target applications include
cloud storage, intrusion detection, or version control systems. BLAKE2 comes
in two main flavors: BLAKE2b is optimized for 64-bit platforms, and BLAKE2s for
smaller architectures. On 64-bit platforms, BLAKE2 is often faster than MD5, yet provides
security similar to that of SHA-3. We specify parallel versions BLAKE2bp and
BLAKE2sp that are up to 4 and 8 times faster, by taking advantage of SIMD and/or
multiple cores. BLAKE2 has more benefits than just speed: BLAKE2 uses up to 32%
less RAM than BLAKE, and comes with a comprehensive tree-hashing mode as well
as an efficient MAC mode.'
crypto  hash  blake2  hashing  blake  algorithms  sha1  sha3  simd  performance  mac 
april 2016 by jm
Conversant ConcurrentQueue and Disruptor BlockingQueue
'Disruptor is the highest performing intra-thread transfer mechanism available in Java. Conversant Disruptor is the highest performing implementation of this type of ring buffer queue because it has almost no overhead and it exploits a particularly simple design.

Conversant has been using this in production since 2012 and the performance is excellent. The BlockingQueue implementation is very stable, although we continue to tune and improve it. The latest release, 1.2.4, is 100% production ready.

Although we have been working on it for a long time, we decided to open source our BlockingQueue this year to contribute something back to the community. ... its a drop in for BlockingQueue, so its a very easy test. Conversant Disruptor will crush ArrayBlockingQueue and LinkedTransferQueue for thread to thread transfers.

In our system, we noticed a 10-20% reduction in overall system load and latency when we introduced it.'
disruptor  blocking-queues  queues  queueing  data-structures  algorithms  java  conversant  concurrency  performance 
march 2016 by jm
The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future
Awesome post from Dan Luu with data from Google:
The cause [of some mystery widespread 250ms hangs] was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren’t too many requests in a quarter second interval and the kernel stops throttling the threads. After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn’t something you can see in a profile.
debugging  performance  visualization  instrumentation  metrics  dan-luu  latency  google  dick-sites  linux  scheduler  throttling  kernel  hangs 
february 2016 by jm
About Microservices, Containers and their Underestimated Impact on Network Performance
shock horror, Docker-SDN layers have terrible performance. Still pretty lousy perf impacts from basic Docker containerization, presumably without "--net=host" (which is apparently vital)
docker  performance  network  containers  sdn  ops  networking  microservices 
january 2016 by jm
BBC Digital Media Distribution: How we improved throughput by 4x
Replacing varnish with nginx. Nice deep-dive blog post covering kernel innards
nginx  performance  varnish  web  http  bbc  ops 
january 2016 by jm
Very Fast Reservoir Sampling
via Tony Finch. 'In this post I will demonstrate how to do reservoir sampling orders of magnitude faster than the traditional “naive” reservoir sampling algorithm, using a fast high-fidelity approximation to the reservoir sampling-gap distribution.'
statistics  reservoir-sampling  sampling  algorithms  poisson  bernoulli  performance 
december 2015 by jm
[LUCENE-6917] Deprecate and rename NumericField/RangeQuery to LegacyNumeric - ASF JIRA
Interesting performance-related tweak going into Lucene -- based on the Bkd-Tree I think: https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf . Being used for all numeric index types, not just multidimensional ones?
lucene  performance  algorithms  patches  bkd-trees  geodata  numeric  indexing 
december 2015 by jm
Low-latency journalling file write latency on Linux
great research from LMAX: xfs/ext4 are the best choices, and they explain why in detail, referring to the code
linux  xfs  ext3  ext4  filesystems  lmax  performance  latency  journalling  ops 
december 2015 by jm
Why Percentiles Don’t Work the Way you Think
Baron Schwartz on metrics, percentiles, and aggregation. +1, although as a HN commenter noted, quantile digests are probably the better fix
performance  percentiles  quantiles  statistics  metrics  monitoring  baron-schwartz  vividcortex 
december 2015 by jm
Topics in High-Performance Messaging
'We have worked together in the field of high-performance messaging for many years, and in that time, have seen some messaging systems that worked well and some that didn't. Successful deployment of a messaging system requires background information that is not easily available; most of what we know, we had to learn in the school of hard knocks. To save others a knock or two, we have collected here the essential background information and commentary on some of the issues involved in successful deployments. This information is organized as a series of topics around which there seems to be confusion or uncertainty. Please contact us if you have questions or comments.'
messaging  scalability  scaling  performance  udp  tcp  protocols  multicast  latency 
december 2015 by jm
Cache-friendly binary search
by reordering items to optimize locality. Via aphyr's dad!
caches  cache-friendly  optimization  data-locality  performance  coding  algorithms 
november 2015 by jm
The impact of Docker containers on the performance of genomic pipelines [PeerJ]
In this paper, we have assessed the impact of Docker containers technology on the performance of genomic pipelines, showing that container “virtualization” has a negligible overhead on pipeline performance when it is composed of medium/long running tasks, which is the most common scenario in computational genomic pipelines.

Interestingly for these tasks the observed standard deviation is smaller when running with Docker. This suggests that the execution with containers is more “homogeneous,” presumably due to the isolation provided by the container environment.

The performance degradation is more significant for pipelines where most of the tasks have a fine or very fine granularity (a few seconds or milliseconds). In this case, the container instantiation time, though small, cannot be ignored and produces a perceptible loss of performance.
performance  docker  ops  genomics  papers 
november 2015 by jm
Seastar
C++ high-performance app framework; 'currently focused on high-throughput, low-latency I/O intensive applications.'

Scylla (Cassandra-compatible NoSQL store) is written in this.
c++  opensource  performance  framework  scylla  seastar  latency  linux  shared-nothing  multicore 
september 2015 by jm
Stormpot
an object pooling library for Java. Use it to recycle objects that are expensive to create. The library will take care of creating and destroying your objects in the background. Stormpot is very mature, is used in production, and has done over a trillion claim-release cycles in testing. It is faster and scales better than any competing pool.


Apache-licensed, and extremely fast: https://medium.com/@chrisvest/released-stormpot-2-4-eeab4aec86d0
java  stormpot  object-pooling  object-pools  pools  allocation  gc  open-source  apache  performance 
september 2015 by jm
Large Java HashMap performance overview
Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove – January 2015 version
java  performance  hashmap  hashmaps  optimization  fastutil  hppc  jdk  koloboke  trove  data-structures 
september 2015 by jm
rwasa
our full-featured, high performance, scalable web server designed to compete with the likes of nginx. It has been built from the ground-up with no external library dependencies entirely in x86_64 assembly language, and is the result of many years' experience with high volume web environments. In addition to all of the common things you'd expect a modern web server to do, we also include assembly language function hooks ready-made to facilitate Rapid Web Application Server (in Assembler) development.
assembly  http  performance  https  ssl  x86_64  web  ops  rwasa  tls 
august 2015 by jm
Java lambdas and performance
Lambdas in Java 8 introduce some unpredictable performance implications, due to reliance on escape analysis to eliminate object allocation on every lambda invocation. Peter Lawrey has some details
lambdas  java-8  java  performance  low-latency  optimization  peter-lawrey  coding  escape-analysis 
july 2015 by jm
How to receive a million packets per second on Linux

To sum up, if you want a perfect performance you need to:
Ensure traffic is distributed evenly across many RX queues and SO_REUSEPORT processes. In practice, the load usually is well distributed as long as there are a large number of connections (or flows).
You need to have enough spare CPU capacity to actually pick up the packets from the kernel.
To make the things harder, both RX queues and receiver processes should be on a single NUMA node.
linux  networking  performance  cloudflare  packets  numa  so_reuseport  sockets  udp 
june 2015 by jm
Improving testing by using real traffic from production
Gor, a very nice-looking tool to log and replay HTTP traffic, specifically designed to "tee" live traffic from production to staging for pre-release testing
gor  performance  testing  http  tcp  packet-capture  tests  staging  tee 
june 2015 by jm
Performance Testing at LMAX
Good series of blog posts on the LMAX trading platform's performance testing strategy -- they capture live traffic off the wire, then build statistical models simulating its features. See also http://epickrram.blogspot.co.uk/2014/07/performance-testing-at-lmax-part-two.html and http://epickrram.blogspot.co.uk/2014/08/performance-testing-at-lmax-part-three.html .
performance  testing  tests  simulation  latency  lmax  trading  sniffing  packet-capture 
june 2015 by jm
HTTP/2 is here, let's optimize! - Velocity SC 2015 - Google Slides
Changes which server-side developers will need to start considering as HTTP/2 rolls out. Remove domain sharding; stop concatenating resources; stop inlining resources; use server push.
http2  http  protocols  streaming  internet  web  dns  performance 
june 2015 by jm
SolarCapture Packet Capture Software
Interesting product line -- I didn't know this existed, but it makes good sense as a "network flight recorder". Big in finance.
SolarCapture is powerful packet capture product family that can transform every server into a precision network monitoring device, increasing network visibility, network instrumentation, and performance analysis. SolarCapture products optimize network monitoring and security, while eliminating the need for specialized appliances, expensive adapters relying on exotic protocols, proprietary hardware, and dedicated networking equipment.


See also Corvil (based in Dublin!): 'I'm using a Corvil at the moment and it's awesome- nanosecond precision latency measurements on the wire.'

(via mechanical sympathy list)
corvil  timing  metrics  measurement  latency  network  solarcapture  packet-capture  financial  performance  security  network-monitoring 
may 2015 by jm
Intel speeds up etcd throughput using ADR Xeon-only hardware feature
To reduce the latency impact of storing to disk, Weaver’s team looked to buffering as a means to absorb the writes and sync them to disk periodically, rather than for each entry. Tradeoffs? They knew memory buffers would help, but there would be potential difficulties with smaller clusters if they violated the stable storage requirement.

Instead, they turned to Intel’s silicon architects about features available in the Xeon line. After describing the core problem, they found out this had been solved in other areas with ADR. After some work to prove out a Linux OS supported use for this, they were confident they had a best-of-both-worlds angle. And it worked. As Weaver detailed in his CoreOS Fest discussion, the response time proved stable. ADR can grab a section of memory, persist it to disk and power it back. It can return entries back to disk and restore back to the buffer. ADR provides the ability to make small (<100MB) segments of memory “stable” enough for Raft log entries. It means it does not need battery-backed memory. It can be orchestrated using Linux or Windows OS libraries. ADR allows the capability to define target memory and determine where to recover. It can also be exposed directly into libs for runtimes like Golang. And it uses silicon features that are accessible on current Intel servers.
kubernetes  coreos  adr  performance  intel  raft  etcd  hardware  linux  persistence  disk  storage  xeon 
may 2015 by jm
Memory Layouts for Binary Search
Key takeaway:
Nearly uni­ver­sally, B-trees win when the data gets big enough.
caches  cpu  performance  optimization  memory  binary-search  b-trees  algorithms  search  memory-layout 
may 2015 by jm
Migration to, Expectations, and Advanced Tuning of G1GC
Bookmarking for future reference. recommended by one of the GC experts, I can't recall exactly who ;)
gc  g1gc  jvm  java  tuning  performance  ops  migration 
may 2015 by jm
The Injector: A new Executor for Java
This honestly fits a narrow niche, but one that is gaining in popularity. If your messages take > 100μs to process, or your worker threads are consistently saturated, the standard ThreadPoolExecutor is likely perfectly adequate for your needs. If, on the other hand, you’re able to engineer your system to operate with one application thread per physical core you are probably better off looking at an approach like the LMAX Disruptor. However, if you fall in the crack in between these two scenarios, or are seeing a significant portion of time spent in futex calls and need a drop in ExecutorService to take the edge off, the injector may well be worth a look.
performance  java  executor  concurrency  disruptor  algorithms  coding  threads  threadpool  injector 
may 2015 by jm
Cassandra moving to using G1 as the default recommended GC implementation
This is a big indicator that G1 is ready for primetime. CMS has long been the go-to GC for production usage, but requires careful, complex hand-tuning -- if G1 is getting to a stage where it's just a case of giving it enough RAM, that'd be great.

Also, looks like it'll be the JDK9 default: https://twitter.com/shipilev/status/593175793255219200
cassandra  tuning  ops  g1gc  cms  gc  java  jvm  production  performance  memory 
april 2015 by jm
_Blade: a Data Center Garbage Collector_
Essentially, add a central GC scheduler to improve tail latencies in a cluster, by taking instances out of the pool to perform slow GC activity instead of letting them impact live operations. I've been toying with this idea for a while, nice to see a solid paper about it
gc  latency  tail-latencies  papers  blade  go  java  scheduling  clustering  load-balancing  low-latency  performance 
april 2015 by jm
Rob Pike's 5 rules of optimization
these are great. I've run into rule #3 ("fancy algorithms are slow when n is small, and n is usually small") several times...
twitter  rob-pike  via:igrigorik  coding  rules  laws  optimization  performance  algorithms  data-structures  aphorisms 
april 2015 by jm
Optimizing Java CMS garbage collections, its difficulties, and using JTune as a solution | LinkedIn Engineering
I like the sound of this -- automated Java CMS GC tuning, kind of like a free version of JClarity's Censum (via Miguel Ángel Pastor)
java  jvm  tuning  gc  cms  linkedin  performance  ops 
april 2015 by jm
Introducing Vector: Netflix's On-Host Performance Monitoring Tool
It gives pinpoint real-time performance metric visibility to engineers working on specific hosts -- basically sending back system-level performance data to their browser, where a client-side renderer turns it into a usable dashboard. Essentially the idea is to replace having to ssh onto instances, run "top", systat, iostat, and so on.
vector  netflix  performance  monitoring  sysstat  top  iostat  netstat  metrics  ops  dashboards  real-time  linux 
april 2015 by jm
Gil Tene's "usual suspects" to reduce system-level hiccups/latency jitters in a Linux system
Based on empirical evidence (across many tens of sites thus far) and note-comparing with others, I use a list of "usual suspects" that I blame whenever they are not set to my liking and system-level hiccups are detected. Getting these settings right from the start often saves a bunch of playing around (and no, there is no "priority" to this - you should set them all right before looking for more advice...).
performance  latency  hiccups  gil-tene  tuning  mechanical-sympathy  hyperthreading  linux  ops 
april 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series  tsd  storage  mysql  sql  baron-schwartz  ops  performance  scalability  scaling  go 
march 2015 by jm
tcpcopy
"tees" all TCP traffic from one server to another. "widely used by companies in China"!
testing  benchmarking  performance  tcp  ip  tcpcopy  tee  china  regression-testing  stress-testing  ops 
march 2015 by jm
Correcting YCSB's Coordinated Omission problem
excellent walkthrough of CO and how it affects Yahoo!'s Cloud Storage Benchmarking platform
coordinated-omission  co  yahoo  ycsb  benchmarks  performance  testing 
march 2015 by jm
Biased Locking in HotSpot (David Dice's Weblog)
This is pretty nuts. If biased locking in the HotSpot JVM is causing performance issues, it can be turned off:
You can avoid biased locking on a per-object basis by calling System.identityHashCode(o). If the object is already biased, assigning an identity hashCode will result in revocation, otherwise, the assignment of a hashCode() will make the object ineligible for subsequent biased locking.
hashcode  jvm  java  biased-locking  locking  mutex  synchronization  locks  performance 
march 2015 by jm
HP is trying to patent Continuous Delivery
This is appalling bollocks from HP:
On 1st March 2015 I discovered that in 2012 HP had filed a patent (WO2014027990) with the USPO for ‘Performance tests in a continuous deployment pipeline‘ (the patent was granted in 2014). [....] HP has filed several patents covering standard Continuous Delivery (CD) practices. You can help to have these patents revoked by providing ‘prior art’ examples on Stack Exchange.

In fairness, though, this kind of shit happens in most big tech companies. This is what happens when you have a broken software patenting system, with big rewards for companies who obtain shitty troll patents like these, and in turn have companies who reward the engineers who sell themselves out to write up concepts which they know have prior art. Software patents are broken by design!
cd  devops  hp  continuous-deployment  testing  deployment  performance  patents  swpats  prior-art 
march 2015 by jm
500 Mbps upload to S3
the following guidelines maximize bandwidth usage:
Optimizing the sizes of the file parts, whether they are part of a large file or an entire small file; Optimizing the number of parts transferred concurrently.
Tuning these two parameters achieves the best possible transfer speeds to [S3].
s3  uploads  dataman  aws  ec2  performance 
march 2015 by jm
JClarity's Illuminate
Performance-diagnosis-as-a-service. Cool.
Users download and install an Illuminate Daemon using a simple installer which starts up a small stand alone Java process. The Daemon sits quietly unless it is asked to start gathering SLA data and/or to trigger a diagnosis. Users can set SLA’s via the dashboard and can opt to collect latency measurements of their transactions manually (using our library) or by asking Illuminate to automatically instrument their code (Servlet and JDBC based transactions are currently supported).

SLA latency data for transactions is collected on a short cycle. When the moving average of latency measurements goes above the SLA value (e.g. 150ms), a diagnosis is triggered. The diagnosis is very quick, gathering key data from O/S, JVM(s), virtualisation and other areas of the system. The data is then run through the machine learned algorithm which will quickly narrow down the possible causes and gather a little extra data if needed.

Once Illuminate has determined the root cause of the performance problem, the diagnosis report is sent back to the dashboard and an alert is sent to the user. That alert contains a link to the result of the diagnosis which the user can share with colleagues. Illuminate has all sorts of backoff strategies to ensure that users don’t get too many alerts of the same type in rapid succession!
illuminate  jclarity  java  jvm  scala  latency  gc  tuning  performance 
february 2015 by jm
What Color Is Your Xen?
What a mess.
What's faster: PV, HVM, HVM with PV drivers, PVHVM, or PVH? Cloud computing providers using Xen can offer different virtualization "modes", based on paravirtualization (PV), hardware virtual machine (HVM), or a hybrid of them. As a customer, you may be required to choose one of these. So, which one?
ec2  linux  performance  aws  ops  pv  hvm  xen  virtualization 
february 2015 by jm
Performance Co-Pilot
System performance metrics framework, plugged by Netflix, open-source for ages
open-source  pcp  performance  system  metrics  ops  red-hat  netflix 
february 2015 by jm
Azul Zing on Ubuntu on AWS Marketplace
hmmm, very interesting -- the super-low-latency Zing JVM is available as a commercial EC2 instance type, at costs less than the EC2 instance price
zing  azul  latency  performance  ec2  aws 
february 2015 by jm
TL;DR: Cassandra Java Huge Pages
Al Tobey does some trial runs of -XX:+AlwaysPreTouch and -XX:+UseHugePages
jvm  performance  tuning  huge-pages  vm  ops  cassandra  java 
february 2015 by jm
Maintaining performance in distributed systems [slides]
Great slide deck from Elasticsearch on JVM/dist-sys performance optimization
performance  elasticsearch  java  jvm  ops  tuning 
january 2015 by jm
Why we don't use a CDN: A story about SPDY and SSL
All of our assets loaded via the CDN [to our client in Australia] in just under 5 seconds. It only took ~2.7s to get those same assets to our friends down under with SPDY. The performance with no CDN blew the CDN performance out of the water. It is just no comparison. In our case, it really seems that the advantages of SPDY greatly outweigh that of a CDN when it comes to speed.
cdn  spdy  nginx  performance  web  ssl  tls  optimization  multiplexing  tcp  ops 
january 2015 by jm
coz
A causal profiler for C++.
Causal profiling is a novel technique to measure optimization potential. This measurement matches developers' assumptions about profilers: that optimizing highly-ranked code will have the greatest impact on performance. Causal profiling measures optimization potential for serial, parallel, and asynchronous programs without instrumentation of special handling for library calls and concurrency primitives. Instead, a causal profiler uses performance experiments to predict the effect of optimizations. This allows the profiler to establish causality: "optimizing function X will have effect Y," exactly the measurement developers had assumed they were getting all along.


I can see this being a good technique to stochastically discover race conditions and concurrency bugs, too.
optimization  c++  performance  coding  profiling  speed  causal-profilers 
december 2014 by jm
Good advice on running large-scale database stress tests
I've been bitten by poor key distribution in tests in the past, so this is spot on: 'I'd run it with Zipfian, Pareto, and Dirac delta distributions, and I'd choose read-modify-write transactions.'

And of course, a dataset bigger than all combined RAM.

Also: http://smalldatum.blogspot.ie/2014/04/biebermarks.html -- the "Biebermark", where just a single row out of the entire db is contended on in a read/modify/write transaction: "the inspiration for this is maintaining counts for [highly contended] popular entities like Justin Bieber and One Direction."
biebermark  benchmarks  testing  performance  stress-tests  databases  storage  mongodb  innodb  foundationdb  aphyr  measurement  distributions  keys  zipfian 
december 2014 by jm
Speeding up Rails 4.2
Reading between the lines, it looks like Rails 4 is waaay slower than 3....
rails  ruby  performance  profiling  discourse 
december 2014 by jm
TCP incast
a catastrophic TCP throughput collapse that occurs as the number of storage servers sending data to a client increases past the ability of an Ethernet switch to buffer packets. In a clustered file system, for example, a client application requests a data block striped across several storage servers, issuing the next data block request only when all servers have responded with their portion (Figure 1). This synchronized request workload can result in packets overfilling the buffers on the client's port on the switch, resulting in many losses. Under severe packet loss, TCP can experience a timeout that lasts a minimum of 200ms, determined by the TCP minimum retransmission timeout (RTOmin).
incast  networking  performance  tcp  bandwidth  buffering  switch  ethernet  capacity 
november 2014 by jm
« earlier      
per page:    204080120160

related tags

3g  4g  360-reviews  802.11n  ack  adler  adr  aeron  aho-corasick  akka  algorithms  allocation  amazon  amdahls-law  ami  analysis  analytics  anycast  apache  aphorisms  aphyr  apis  append  apple  architecture  archives  arrays  ascii  asl  assembly  async  atomic  aws  az  azul  b-trees  backups  bandwidth  baron-schwartz  batch  bbc  benchmarking  benchmarks  bernoulli  bgp  biased-locking  biebermark  big-data  big-o  binary-heap  binary-search  bitmaps  bkd-trees  blade  blake  blake2  blink  blocking-queues  bloom-filters  bobtail  book  bottlenecks  brendan-gregg  bridge  browser  buffer-cache  bufferbloat  buffering  buffers  bugs  burst-balance  bytearrayoutputstream  bytebuffers  c++  cache  cache-eviction  cache-friendly  caches  caching  calico  capacity  capture  capturing  cas  cassandra  causal-profilers  cd  cdn  certificates  charlie-hunt  checklists  china  chrome  chronicle  ciphers  cli  cloudera  cloudflare  clustering  cms  co  coda-hale  coding  collections  columnar  compare-and-set  compilation  complexity  compression  concurrency  consistent-hashing  containers  contention  continuous-deployment  conversant  conversion  coordinated-omission  copy  copying  coreos  corvil  cost  counting  cpu  crypto  cuckoo-hashing  curl  dan-luu  dashboards  data  data-locality  data-oriented-programming  data-structures  database  databases  datacenters  datadog  dataman  dax  debugging  defrag  demerphq  deployment  devops  dfa  dick-sites  diffie-hellman  discourse  disk  disks  disruptor  distributions  dns  docker  docs  documents  doug-lea  druid.io  dstat  dynamodb  ebooks  ebs  ec2  elasticsearch  email  emr  encryption  epoll  equations  erlang  escape-analysis  estimation  etcd  ethernet  event-processing  event-streams  events  exabgp  excel  exceptions  executor  ext3  ext4  facebook  fail  fast-path  fastutil  files  filesystems  financial  fincore  flood.io  fork  foundationdb  fpga  frame-of-reference  framework  freebsd  g1  g1gc  garbage  garbage-collection  gatling  gc  gdb  genomics  geodata  gil  gil-tene  gizzard  go  golang  google  gor  graph  graphite  gsm  guava  guppy  gzip  hacks  hadoop  handshake  hangs  haproxy  har  hardware  hash  hash-tables  hashcode  hashing  hashmap  hashmaps  hashsets  hbase  hdrhistogram  heap  hex  hiccups  history  hive  hll  hosting  hotspot  hp  hppc  hr  hsdpa  http  http2  https  huge-pages  hvm  hyperloglog  hyperthreading  iblt  ibm  illuminate  ilya-grigorik  impala  incast  indexing  infoq  injector  innodb  insane  instrumentation  integers  intel  internet  interning  interoperability  interpreters  inviso  io  ioprofile  ios  iostat  ip  ipc  java  java-8  javascript  jclarity  jdk  jdk6  jeff-dean  jetty  jiq  jit  jitter  jmeter  jmm  john-rauser  join  join-idle-queue  journalling  jruby  json  jsr166  juniper  jvm  kafka  kernel  key-value-stores  keys  koloboke  kubernetes  l1  lambda  lambdas  languages  latencies  latency  laws  lazyset  lbs  leveldb  libraries  library  likwid  linkedin  linode  linux  littles-law  lmax  load  load-balancers  load-balancing  load-generation  load-testing  lock-free  locking  locks  log4j  logging  long-distance  long-tail  longadder  low-latency  lua  lucene  mac  mail  mailinator  management  map-reduce  marc-brooker  mark-callaghan  martin-thompson  measurement  measurements  mechanical-sympathy  memcached  memcpy  memory  memory-barriers  memory-layout  merge  messaging  messing  metamarkets  metrics  microbenchmarks  microservices  microsoft  migration  mincore  mirrors  mission-control  mmap  mobile  modelling  mongodb  mongrel2  monitoring  mozilla  mpi  mpsc  mpstat  multicast  multicore  multiplexing  multithreading  mutex  mysql  mythbusters  nagle  namedtuples  near-neighbour-search  netflix  nethogs  netstat  netty  network  network-monitoring  networking  networks  nfa  nginx  nio  nosql  numa  numbers  numeric  object-pooling  object-pools  off-heap-storage  one-liners  oop  open-source  openhft  opensource  ops  optimisation  optimization  ospf  oss  ouch  out-of-band  overflow  overhead  overlay  overload  p99  packet-capture  packets  page-cache  papers  parallel  passenger  patches  patents  patterns  paul-tyma  pcp  pcre  pdf  peer-driven-review  peers  percentiles  percona  perf  performance  perl  perlmonks  persistence  peter-lawrey  phasers  phones  photos  phusion  picloud  pigz  pinning  piops  poisson  poll  pools  presentation  presentations  primitives  prior-art  priority-queue  probabilistic  production  profiling  programming  protobufs  protocols  proxies  prun  pthreads  putordered  pv  pytables  python  qcon  quantiles  queueing  queues  quickselect  raft  rails  ram  randomaccessfile  rdbms  rdd  re2  real-time  recording  red-hat  redis  redshift  reference  regex  regexps  regression-testing  regular-expressions  reservoir-sampling  reviews  riak  riemann  ring-buffers  roaring-bitmaps  rob-pike  root-cause  round-trip  routing  rsync  rtt  ruby  rules  rwasa  s3  sampling  sampling-profiler  sar  scala  scalability  scaling  scheduler  schedulers  scheduling  sched_batch  scott-andreas  scp  scribe  scylla  sdn  search  seastar  security  seda  server  servers  service-metrics  servlets  sha1  sha3  shared-nothing  shell  shuffle  shuffling  shutterstock  simd  simulation  slides  slow-start  sniffing  soa  sockets  software  solarcapture  sort  sort-based-shuffle  so_reuseport  spark  spdy  speed  speed-of-light  sql  sr-iov  ssd  sse  ssl  stack-traces  staff  staging  state-machines  statistics  statsd  statsite  stock-markets  stolen-cpu  storage  stormpot  strace  streaming  streams  street-art  stress-testing  stress-tests  string-matching  strings  stud  succinct-encoding  swap  swap-insanity  swapping  swarm  switch  swpats  synchronization  sysadmin  syscalls  sysctl  sysctls  sysstat  system  system-dynamics  systems  systemtap  tail  tail-latencies  talks  tcp  tcp-ip  tcpcopy  tcprstat  tcp_cork  tcp_nodelay  teams  tech-talks  tee  telemetry  testing  tests  threading  threadpool  threads  throttling  throughput  time-series  time-wait  timing  timsort  tips  tls  tools  top  top-k  towatch  trace  tracing  trading  transfers  transparent-huge-pages  tries  troubleshooting  trove  tsd  tsunami-udp  tunables  tuning  tutorials  twitter  udp  undocumented  unicorn  unix  unladen-swallow  uploads  use  usl  v8  varnish  vector  vectors  via:akohli  via:codeslinger  via:declanmcgrath  via:dehora  via:eoin-brazil  via:fanf  via:filippo  via:igrigorik  via:jacob  via:jeffbarr  via:jgc  via:jzawodny  via:kellabyte  via:lcbo  via:martin-thompson  via:nmaurer  via:norman-maurer  via:preddit  video  vimeo  vips  virtualization  visualization  vividcortex  vm  volatile  wait-free  war-stories  weave  web  web-services  webrtc  websockets  whatsapp  wifi  windows  wireless  work  workloads  writev  wrk  x86_64  xen  xeon  xfs  xhr  yahoo  ycsb  yourkit  zedshaw  zing  zip  zipfian  zippy 

Copy this bookmark:



description:


tags: