jm + optimization   51

Log analyser and visualiser for the HotSpot JIT compiler. Inspect inlining decisions, hot methods, bytecode, and assembly. View results in the JavaFX user interface.
analysis  java  jvm  performance  tools  debugging  optimization  jit 
27 days ago by jm
usl4j And You |
Coda Hale wrote a handy java library implementing a USL solver
usl  scalability  java  performance  optimization  benchmarking  measurement  ops  coda-hale 
june 2017 by jm
Semisupervised Intern Descent
Also known as "Graduate Student Descent", a common approach to hyperparameter tuning in machine learning -- ie. get an intern to sit there tweaking parameters until they find something approximating optimal performance
optimization  funny  terms  terminology  graduate-student-descent  hyperparameters  semisupervised-intern-descent  tuning  interns 
february 2016 by jm
Cache-friendly binary search
by reordering items to optimize locality. Via aphyr's dad!
caches  cache-friendly  optimization  data-locality  performance  coding  algorithms 
november 2015 by jm
Large Java HashMap performance overview
Large HashMap overview: JDK, FastUtil, Goldman Sachs, HPPC, Koloboke, Trove – January 2015 version
java  performance  hashmap  hashmaps  optimization  fastutil  hppc  jdk  koloboke  trove  data-structures 
september 2015 by jm
Java lambdas and performance
Lambdas in Java 8 introduce some unpredictable performance implications, due to reliance on escape analysis to eliminate object allocation on every lambda invocation. Peter Lawrey has some details
lambdas  java-8  java  performance  low-latency  optimization  peter-lawrey  coding  escape-analysis 
july 2015 by jm
Memory Layouts for Binary Search
Key takeaway:
Nearly uni­ver­sally, B-trees win when the data gets big enough.
caches  cpu  performance  optimization  memory  binary-search  b-trees  algorithms  search  memory-layout 
may 2015 by jm
Rob Pike's 5 rules of optimization
these are great. I've run into rule #3 ("fancy algorithms are slow when n is small, and n is usually small") several times...
twitter  rob-pike  via:igrigorik  coding  rules  laws  optimization  performance  algorithms  data-structures  aphorisms 
april 2015 by jm
Why we don't use a CDN: A story about SPDY and SSL
All of our assets loaded via the CDN [to our client in Australia] in just under 5 seconds. It only took ~2.7s to get those same assets to our friends down under with SPDY. The performance with no CDN blew the CDN performance out of the water. It is just no comparison. In our case, it really seems that the advantages of SPDY greatly outweigh that of a CDN when it comes to speed.
cdn  spdy  nginx  performance  web  ssl  tls  optimization  multiplexing  tcp  ops 
january 2015 by jm
A causal profiler for C++.
Causal profiling is a novel technique to measure optimization potential. This measurement matches developers' assumptions about profilers: that optimizing highly-ranked code will have the greatest impact on performance. Causal profiling measures optimization potential for serial, parallel, and asynchronous programs without instrumentation of special handling for library calls and concurrency primitives. Instead, a causal profiler uses performance experiments to predict the effect of optimizations. This allows the profiler to establish causality: "optimizing function X will have effect Y," exactly the measurement developers had assumed they were getting all along.

I can see this being a good technique to stochastically discover race conditions and concurrency bugs, too.
optimization  c++  performance  coding  profiling  speed  causal-profilers 
december 2014 by jm
Facebook's drop-in replacement for std::vector
Fixes some low-hanging fruit, performance-wise.

'Simply replacing std::vector with folly::fbvector (after having included the folly/FBVector.h header file) will improve the performance of your C++ code using vectors with common coding patterns. The improvements are always non-negative, almost always measurable, frequently significant, sometimes dramatic, and occasionally spectacular.'

(via Tony Finch)
c++  facebook  performance  algorithms  vectors  via:fanf  optimization 
september 2014 by jm
A nice Lua/C++ implementation of Aho-Corasick for fast string matching against multiple patterns (via JGC). This uses an interesting technique to get better performance by compacting the data structure into a single buffer, to avoid following pointers all over RAM and busting the cache.
optimization  speed  performance  aho-corasick  tries  string-matching  strings  algorithms  lua  c++  via:jgc 
august 2014 by jm
'Addressing the rebalancing problem in bike-sharing systems' [paper]
Many of the bike-sharing systems introduced around the world in the past 15 years have the same problem: Riders tend to take some routes and not others. As a result, the bikes tend to collect in a few places, which is a drag for users and a costly problem for the operators, who "rebalance" the system using trucks that take bikes from full stations to empty ones. Now, scientists are coming up with special algorithms to improve this process. One of them, developed by scientists at the Vienna University of Technology and the Austrian Institute of Technology, is now being tested in Vienna's bike-sharing system; another, developed at Cornell University, is already in use in New York City.

Timely -- here's what Dublin Bikes looked like this morning:

(via Andrew Caines)
cycling  bike-sharing  borisbikes  dublinbikes  rebalancing  fleet  availability  optimization  maths  papers  toread  algorithms 
august 2014 by jm
'TCP And The Lower Bound of Web Performance' [pdf, slides]
John Rauser, Velocity, June 2010. Good data on real-world web perf based on the limitations which TCP and the speed of light impose
tcp  speed-of-light  performance  web  optimization  john-rauser 
august 2014 by jm
"Pitfalls of Object Oriented Programming", SCEE R&D
Good presentation discussing "data-oriented programming" -- the concept of optimizing memory access speed by laying out large data in a columnar format in RAM, rather than naively in the default layout that OOP design suggests
columnar  ram  memory  optimization  coding  c++  oop  data-oriented-programming  data  cache  performance 
july 2014 by jm
Exceptional Performance
Good benchmark data on the performance of JVM exceptions
java  jvm  exceptions  benchmarking  performance  optimization  coding 
may 2014 by jm
Druid | How We Scaled HyperLogLog: Three Real-World Optimizations
3 optimizations have made to the HLL algorithm to scale it up for production use in Metamarkets: compacting registers (fixes a bug with unions of multiple HLLs); a sparse storage format (to optimize space); faster lookups using a lookup table.  metamarkets  scaling  hyperloglog  hll  algorithms  performance  optimization  counting  estimation 
april 2014 by jm
Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications
LinkedIn talk about the GC opts they used to optimize the Feed. good detail
performance  optimization  linkedin  java  jvm  gc  tuning 
april 2014 by jm
Netty Best Practices
'a.k.a. Faster == Better'. Slides from a talk given at Facebook on 19th March 2014 by Norman Maurer
netty  java  performance  optimization  facebook  slides  presentations 
april 2014 by jm
[#1259] Add optimized queue for SCMP pattern and use it in NIO and nativ... · 6efac61 · netty/netty
Interesting -- Netty has imported an optimized ASL2-licensed MPSC queue implementation from Akka (presumably for performance raisins)
performance  optimization  open-source  mpsc  queues  data-structures  netty  akka  java 
march 2014 by jm
Impact of large primitive arrays (BLOBS) on JVM Garbage Collection
some nice graphs and data on CMS performance, with/without -XX:ParGCCardsPerStrideChunk
cms  java  jvm  performance  optimization  tuning  off-heap-storage  memory 
march 2014 by jm
Branchless hex-to-decimal conversion hack
via @simonebordet, on the mechanical-sympathy list: ((c & 0x1F) + ((c >> 6) * 0x19) – 0x10)
hacks  one-liners  coding  performance  optimization  hex  conversion  numbers  ascii 
january 2014 by jm
Google Fonts recently switched to using Zopfli
Google Fonts recently switched to using new Zopfli compression algorithm:  the fonts are ~6% smaller on average, and in some cases up to 15% smaller! [...]
What's Zopfli? It's an algorithm that was developed by the compression team at Google that delivers ~3~8% bytesize improvement when compared to gzip with maximum compression. This byte savings comes at a cost of much higher encoding cost, but the good news is, fonts are static files and decompression speed is exactly the same. Google Fonts pays the compression cost once and every clients gets the benefit of smaller download. If you’re curious to learn more about Zopfli:
zopfli  compression  gzip  fonts  google  speed  optimization 
january 2014 by jm
Netflix: Your Linux AMI: optimization and performance [slides]
a fantastic bunch of low-level kernel tweaks and tunables which Netflix have found useful in production to maximise productivity of their fleet. Interesting use of SCHED_BATCH process scheduler class for batch processes, in particular. Also, great docs on their experience with perf and SystemTap. Perf really looks like a tool I need to get to grips with...
netflix  aws  tuning  ami  perf  systemtap  tunables  sched_batch  batch  hadoop  optimization  performance 
december 2013 by jm
stereopsis : graphics : radix tricks
some nice super-optimized Radix Sort code which handles floating point values. See also for more info on the histogramming/counter concept
sorting  programming  coding  algorithms  radix-sort  optimization  floating-point 
december 2013 by jm
Asynchronous logging versus Memory Mapped Files
Interesting article around using mmap'd files from Java using RandomAccessFile.getChannel().map(), which allows them to be accessed directly as a ByteBuffer. together with Atomic variable lazySet() operations, this provides pretty excellent performance results on low-latency writes to disk. See also:
atomic  lazyset  putordered  jmm  java  synchronization  randomaccessfile  bytebuffers  performance  optimization  memory  disk  queues 
november 2013 by jm
SPSC revisited part III - FastFlow + Sparse Data
holy moly. This is some heavily-optimized mechanical-sympathy Java code. By using a sparse data structure, cache-aligned fields, and wait-free low-level CAS concurrency primitives via sun.misc.Unsafe, a single-producer/single-consumer queue implementation goes pretty damn fast compared to the current state of the art
nitsanw  optimization  concurrency  java  jvm  cas  spsc  queues  data-structures  algorithms 
october 2013 by jm
Groundbreaking Results for High Performance Trading with FPGA and x86 Technologies
The enhancement in performance was achieved by providing a fast-path where trades are executed directly by the FPGA under the control of trigger rules processed by the x86 based functions. The latency is reduced further by two additional techniques in the FPGA – inline parsing and pre-emption. As market data enters the switch, the Ethernet frame is parsed serially as bits arrive, allowing partial information to be extracted and matched before the whole frame has been received. Then, instead of waiting until the end of a potential triggering input packet, pre-emption is used to start sending the overhead part of a response which contains the Ethernet, IP, TCP and FIX headers. This allows completion of an outgoing order almost immediately after the end of the triggering market feed packet.

Insane stuff. (Via Martin Thompson)
via:martin-thompson  insane  speed  low-latency  fpga  fast-path  trading  stock-markets  performance  optimization  ethernet 
october 2013 by jm
Probabalistic Scraping of Plain Text Tables
a nifty hack.
Recently I have been banging my head trying to import a ton of OCR acquired data expressed in tabular form. I think I have come up with a neat approach using probabilistic reasoning combined with mixed integer programming. The method is pretty robust to all sorts of real world issues. In particular, the method leverages topological understanding of tables, encodes it declaratively into a mixed integer/linear program, and integrates weak probabilistic signals to classify the whole table in one go (at sub second speeds). This method can be used for any kind of classification where you have strong logical constraints but noisy data.

(via proggit)
scraping  tables  ocr  probabilistic  linear-programming  optimization  machine-learning  via:proggit 
september 2013 by jm
Java Garbage Collection Distilled
Martin Thompson lays it out:
Serial, Parallel, Concurrent, CMS, G1, Young Gen, New Gen, Old Gen, Perm Gen, Eden, Tenured, Survivor Spaces, Safepoints, and the hundreds of JVM start-up flags. Does this all baffle you when trying to tune the garbage collector while trying to get the required throughput and latency from your Java application? If it does then don’t worry, you are not alone. Documentation describing garbage collection feels like man pages for an aircraft. Every knob and dial is detailed and explained but nowhere can you find a guide on how to fly. This article will attempt to explain the tradeoffs when choosing and tuning garbage collection algorithms for a particular workload.
gc  java  garbage-collection  coding  cms  g1  jvm  optimization 
june 2013 by jm
Big Memory, Part 4
good microbenchmarking of a bunch of Java collections; Trove, fastutil, PCJ, mahout-collections, hppc
java  collections  benchmarks  performance  speed  coding  data-structures  optimization 
june 2013 by jm
fastutil extends the Java™ Collections Framework by providing type-specific maps, sets, lists and queues with a small memory footprint and fast access and insertion; provides also big (64-bit) arrays, sets and lists, and fast, practical I/O classes for binary and text files. It is free software distributed under the Apache License 2.0. It requires Java 6 or newer.

used by Facebook (along with Apache Giraph, Netty, Unsafe) to speed up "weekend Hive jobs" to "coffee breaks".
via:highscalability  facebook  giraph  optimization  java  speed  fastutil  collections  data-structures 
june 2013 by jm
Measuring & Optimizing I/O Performance
Another good writeup on iostat and EBS, from Ilya Grigorik
io  optimization  sysadmin  performance  iostat  ebs  aws  ops 
may 2013 by jm
Jetty-9 goes fast with Mechanical Sympathy
This is very cool! Applying Mechanical Sympathy optimization techniques to Jetty, specifically: "False sharing" on the BlockingArrayQueue data structure resolved; a new ArrayTernaryTrie data structure to improve header field storage, making it faster to build. look up, efficient on RAM, cheap to GC, and more cache-friendly than a traditional trie; and a branchless hex-to-byte conversion statement. The results are a 30%-faster microbenchmark on amd64, with 50% less Young Gen garbage collections. Lovely to see low-level infrastructure libs like Jetty getting this kind of optimization.
jetty  java  mechanical-sympathy  optimization  coding  tries 
february 2013 by jm
Implementing strcmp, strlen, and strstr using SSE 4.2 instructions -
Using new Intel Core i7 instructions to speed up string manipulation. Fascinating stuff. SSE ftw
sse  optimization  simd  assembly  intel  i7  intel-core  strstr  strings  string-matching  strchr  strlen  coding 
january 2013 by jm
AnandTech - The Intel SSD DC S3700: Intel's 3rd Generation Controller Analyzed
Interesting trend; Intel moved from a btree to an array-based data structure for their logical-block address indirection map, in order to reduce worst-case latencies (via Martin Thompson)
latency  intel  via:martin-thompson  optimization  speed  p99  data-structures  arrays  btrees  ssd  hardware 
november 2012 by jm
PCRE Performance Project
Excellent stuff. Using "sljit", a stackless platform-independent JIT compiler, this compiles Perl-compatible regular expressions to machine code on ARM, x86, MIPS and PowerPC platforms, resulting in 'similar matching speed to DFA based engines (like re2) on common patterns' with Perl compatibility. 'This work has been released as part of PCRE 8.20 and above. Now (PCRE 8.31), nearly all PCRE features are supported including UTF-8/16 and partial matching.'
pcre  regexps  regex  performance  optimization  jit  compilation  dfa  re2  via:akohli 
september 2012 by jm
Avoiding Hash Lookups in a Ruby Implementation
'If I were to sum up the past 6 years I've spent optimizing JRuby it would be with the following phrase: Get Rid Of Hash Lookups.'

This has been a particular theme of some recent optimization hacks I've been working on. Hashes may be O(1) to read, on average, but that doesn't necessarily mean they're the right tool for performance...

(via Declan McGrath)
via:declanmcgrath  hash  optimization  ruby  performance  jruby  hashing  data-structures  big-o  optimisation 
september 2012 by jm
Expensive lessons in Python performance tuning
some good advice for large-scale Python performance: prun and guppy for profiling, namedtuples for memory efficiency, and picloud for trivial EC2-based scale-out. (via Nelson)
picloud  prun  guppy  namedtuples  python  optimization  performance  tuning  profiling 
july 2012 by jm
Linux Profiling tools and techniques
great tips for system-level and app-level profiling on Linux from Padraig
profiling  optimization  linux  cache  valgrind 
april 2011 by jm
A high-performance compressor optimized for binary data -- 'designed to transmit data to the processor cache faster than a traditional, non-compressed, direct memory fetch via memcpy()' (via Bill de hOra)
via:dehora  compression  memcpy  caching  l1  software  memory  optimization  performance  python  pytables  from delicious
october 2010 by jm
The MySQL “swap insanity” problem and the effects of the NUMA architecture
very interesting; modern multicore x86 architectures use a NUMA memory architecture, which can cause a dip into swap, even when there appears to be plenty of free RAM available
linux  memory  mysql  optimization  performance  swap  tuning  vm  numa  swap-insanity  swapping  from delicious
september 2010 by jm
'a Perl script that generates an spreadsheet which loads up SpamAssassin configuration and known spam and ham messages. Once loaded, you can tweak individual SpamAssassin scores in the spreadsheet itself and see their effect on spam/ham classification in real-time. The script also shows you the number of false positives and negatives for a set of scores in real-time.' by Raj Mathur <raju at>
spamtune  spamassassin  rules  scores  optimization  tweaking  openoffice  from delicious
april 2010 by jm
"Source Code Optimisation", Felix von Leitner, Linux Kongress 2009 [PDF]
Good presentation on C compiler optimization, via Cal Henderson. 'People often write less readable code because they think it will produce faster code. Unfortunately, in most cases, the code will not be faster.' I particularly like 'Fancy-Schmancy Algorithms': 'If you have 10-100 elements, use a list, not a red-black tree; Fancy data structures help on paper, but rarely in reality. (More space overhead in the data structure, less L2 cache left for actual data.)'
via:iamcal  compilers  c  c++  optimization  coding  assembly  speed  from delicious
november 2009 by jm

related tags

aho-corasick  akka  algorithms  ami  analysis  aphorisms  arrays  ascii  assembly  atomic  autoscaling  availability  aws  b-trees  batch  benchmarking  benchmarks  big-data  big-o  bike-sharing  binary-search  bitmaps  borisbikes  btrees  bytebuffers  c  c++  cache  cache-friendly  caches  caching  cas  causal-profilers  cdn  cloudera  cms  coda-hale  coding  collections  columnar  compilation  compilers  compression  concurrency  conversion  counting  cpu  cycling  data  data-locality  data-oriented-programming  data-structures  debugging  dfa  disk  dublinbikes  ebs  elasticsearch  emr  escape-analysis  estimation  ethernet  exceptions  facebook  fast-path  fastutil  fleet  floating-point  fonts  fpga  frame-of-reference  funny  g1  g1gc  garbage-collection  gc  giraph  google  graduate-student-descent  guppy  gzip  hacks  hadoop  hardware  hash  hashing  hashmap  hashmaps  hex  hll  hppc  hyperloglog  hyperparameters  i7  insane  integers  intel  intel-core  interns  io  iostat  java  java-8  jdk  jetty  jit  jmm  john-rauser  jruby  jvm  koloboke  l1  lambdas  latency  laws  lazyset  linear-programming  linkedin  linux  low-latency  lua  lucene  machine-learning  map-reduce  maths  measurement  mechanical-sympathy  memcpy  memory  memory-layout  metamarkets  mpsc  multiplexing  mysql  namedtuples  netflix  netty  nginx  nitsanw  numa  numbers  ocr  off-heap-storage  one-liners  oop  open-source  openoffice  ops  optimisation  optimization  p99  papers  pcre  perf  performance  peter-lawrey  picloud  power  presentations  probabilistic  profiling  programming  prun  putordered  pytables  python  queues  radix-sort  ram  randomaccessfile  rdd  re2  rebalancing  regex  regexps  roaring-bitmaps  rob-pike  ruby  rules  scalability  scaling  sched_batch  scores  scraping  search  semisupervised-intern-descent  shuffle  shuffling  simd  slides  software  sort  sorting  sorting-networks  spamassassin  spamtune  spark  spdy  speed  speed-of-light  spsc  ssd  sse  ssl  stack-overflow  stock-markets  strchr  string-matching  strings  strlen  strstr  swap  swap-insanity  swapping  synchronization  sysadmin  systemtap  tables  tcp  terminology  terms  tips  tls  tools  toread  trading  tries  trove  tunables  tuning  tweaking  twitter  usl  valgrind  vectors  via:akohli  via:declanmcgrath  via:dehora  via:eoinbrazil  via:fanf  via:highscalability  via:iamcal  via:igrigorik  via:jgc  via:martin-thompson  via:proggit  vm  web  zopfli 

Copy this bookmark: