jm + low-latency   17

Amazon DynamoDB Accelerator (DAX)
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. DAX does all the heavy lifting required to add in-memory acceleration to your DynamoDB tables, without requiring developers to manage cache invalidation, data population, or cluster management.

No latency percentile figures, unfortunately. Also still in preview.
amazon  dynamodb  aws  dax  performance  storage  databases  latency  low-latency 
6 days ago by jm
Java lambdas and performance
Lambdas in Java 8 introduce some unpredictable performance implications, due to reliance on escape analysis to eliminate object allocation on every lambda invocation. Peter Lawrey has some details
lambdas  java-8  java  performance  low-latency  optimization  peter-lawrey  coding  escape-analysis 
july 2015 by jm
a Java based low latency, high throughput message bus, built on top of a memory mapped file; inspired by Java Chronicle with the main difference that it's designed to efficiently support multiple writers – enabling use cases where the order of messages produced by multiple processes are important. MappedBus can be also described as an efficient IPC mechanism which enable several Java programs to communicate by exchanging messages.
ipc  java  jvm  mappedbus  low-latency  mmap  message-bus  data-structures  queue  message-passing 
may 2015 by jm
_Blade: a Data Center Garbage Collector_
Essentially, add a central GC scheduler to improve tail latencies in a cluster, by taking instances out of the pool to perform slow GC activity instead of letting them impact live operations. I've been toying with this idea for a while, nice to see a solid paper about it
gc  latency  tail-latencies  papers  blade  go  java  scheduling  clustering  load-balancing  low-latency  performance 
april 2015 by jm
Nanex: "The stock market is rigged" [by HFTs]
All this evidence points to one inescapable conclusion: the order cancellations and trade executions just before, and during the trader's order were not a coincidence. This is premeditated, programmed theft, plain and simple. Michael Lewis probably said it best when he told 60 Minutes that the stock market is rigged.

Nanex have had enough, basically. Mad stuff.
hft  stocks  finance  market  trading  nanex  60-minutes  michael-lewis  scams  sec  regulation  low-latency  exploits  hacks 
july 2014 by jm
Simple Binary Encoding
an OSI layer 6 presentation for encoding/decoding messages in binary format to support low-latency applications. [...] SBE follows a number of design principles to achieve this goal. By adhering to these design principles sometimes means features available in other codecs will not being offered. For example, many codecs allow strings to be encoded at any field position in a message; SBE only allows variable length fields, such as strings, as fields grouped at the end of a message.

The SBE reference implementation consists of a compiler that takes a message schema as input and then generates language specific stubs. The stubs are used to directly encode and decode messages from buffers. The SBE tool can also generate a binary representation of the schema that can be used for the on-the-fly decoding of messages in a dynamic environment, such as for a log viewer or network sniffer.

The design principles drive the implementation of a codec that ensures messages are streamed through memory without backtracking, copying, or unnecessary allocation. Memory access patterns should not be underestimated in the design of a high-performance application. Low-latency systems in any language especially need to consider all allocation to avoid the resulting issues in reclamation. This applies for both managed runtime and native languages. SBE is totally allocation free in all three language implementations.

The end result of applying these design principles is a codec that has ~25X greater throughput than Google Protocol Buffers (GPB) with very low and predictable latency. This has been observed in micro-benchmarks and real-world application use. A typical market data message can be encoded, or decoded, in ~25ns compared to ~1000ns for the same message with GPB on the same hardware. XML and FIX tag value messages are orders of magnitude slower again.

The sweet spot for SBE is as a codec for structured data that is mostly fixed size fields which are numbers, bitsets, enums, and arrays. While it does work for strings and blobs, many my find some of the restrictions a usability issue. These users would be better off with another codec more suited to string encoding.
sbe  encoding  protobuf  protocol-buffers  json  messages  messaging  binary  formats  low-latency  martin-thompson  xml 
may 2014 by jm
Simple Binary Encoding
'SBE is an OSI layer 6 representation for encoding and decoding application messages in binary format for low-latency applications.'

Licensed under ASL2, C++ and Java supported.
sbe  encoding  codecs  persistence  binary  low-latency  open-source  java  c++  serialization 
december 2013 by jm
Jeff Dean - Taming Latency Variability and Scaling Deep Learning [talk]
'what Jeff Dean and team have been up to at Google'. Reducing request latency in a network SOA architecture using backup requests, etc., via Ilya Grigorik
youtube  talks  google  low-latency  soa  architecture  distcomp  jeff-dean  networking 
november 2013 by jm
Storm at - London Storm Meetup 2013-06-18
Not just a Storm success story. Interesting slides indicating where a startup *stopped* using Storm as realtime wasn't useful to their customers
storm  realtime  hadoop  cascading  python  cep  anti-spam  events  architecture  distcomp  low-latency  slides  rabbitmq 
october 2013 by jm
Barbarians at the Gateways - ACM Queue

I am a former high-frequency trader. For a few wonderful years I led a group of brilliant engineers and mathematicians, and together we traded in the electronic marketplaces and pushed systems to the edge of their capability.

Insane stuff -- FPGAs embedded in the network switches to shave off nanoseconds of latency.
low-latency  hft  via:nelson  markets  stock-trading  latency  fpgas  networking 
october 2013 by jm
Groundbreaking Results for High Performance Trading with FPGA and x86 Technologies
The enhancement in performance was achieved by providing a fast-path where trades are executed directly by the FPGA under the control of trigger rules processed by the x86 based functions. The latency is reduced further by two additional techniques in the FPGA – inline parsing and pre-emption. As market data enters the switch, the Ethernet frame is parsed serially as bits arrive, allowing partial information to be extracted and matched before the whole frame has been received. Then, instead of waiting until the end of a potential triggering input packet, pre-emption is used to start sending the overhead part of a response which contains the Ethernet, IP, TCP and FIX headers. This allows completion of an outgoing order almost immediately after the end of the triggering market feed packet.

Insane stuff. (Via Martin Thompson)
via:martin-thompson  insane  speed  low-latency  fpga  fast-path  trading  stock-markets  performance  optimization  ethernet 
october 2013 by jm
_MillWheel: Fault-Tolerant Stream Processing at Internet Scale_ [paper, pdf]
from VLDB 2013:

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework’s fault-tolerance guarantees.

This paper describes MillWheel’s programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel’s features are used. MillWheel’s programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel’s unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.
millwheel  google  data-processing  cep  low-latency  fault-tolerance  scalability  papers  event-processing  stream-processing 
august 2013 by jm
Log4j 2: Performance close to insane
Nice writeup on Log4j 2's new AsyncAppender implementation, based on the LMAX Disruptor. sounds pretty excellent:
“One nice little detail I should mention is that both Async Loggers and Async Appenders fix something that has always bothered me in Log4j-1.x, which is that they will flush the buffer after logging the last event in the queue . With Log4j-1.x, if you used buffered I/O, you often could not see the last few log events, as they were still stuck in the memory buffer. Your only option was setting immediateFlush to true, which forces disk I/O on every single log event and has a performance impact.
With Async Loggers and Appenders in Log4j-2.0 your log statements are all flushed to disk, so they are always visible, but this happens in a very efficient manner.”
logging  java  performance  async  disruptor  low-latency 
july 2013 by jm
Facebook announce Wormhole
Over the last couple of years, we have built and deployed a reliable publish-subscribe system called Wormhole. Wormhole has become a critical part of Facebook's software infrastructure. At a high level, Wormhole propagates changes issued in one system to all systems that need to reflect those changes – within and across data centers.

Facebook's Kafka-alike, basically, although with some additional low-latency guarantees. FB appear to be using it for multi-region and multi-AZ replication. Proprietary.
pub-sub  scalability  facebook  realtime  low-latency  multi-region  replication  multi-az  wormhole 
june 2013 by jm
Low-latency stock trading "jumps the gun" due to default NTP configuration settings
On June 3, 2013, trading in SPY exploded at 09:59:59.985, which is 15 milliseconds before the ISM's Manufacturing number released at 10:00:00. Activity in the eMini (traded in Chicago), exploded at 09:59:59.992, which is 8 milliseconds before the news release, but 7 milliseconds after SPY. Note how SPY and the eMini traded within a millisecond for the Consumer Confidence release last week, but the eMini lagged SPY by about 7 milliseconds for the ISM Manufacturing release. The simultaneous trading on Consumer Confidence is because that number is released at the same time in both NYC and Chicago.

The ISM Manufacturing number is probably released on a low latency feed in NYC, and then takes 5-7 milliseconds, due to the speed of light, to reach Chicago. Either the clock used to release the ISM number was 15 milliseconds fast, or someone (correctly) jumped the gun.

Update: [...] The clock used to release the ISM was indeed, 15 milliseconds fast. This could be from using the default setting of many NTP clients, which allows the clock to drift up to about 16 milliseconds before adjusting time.
ntp  time  synchronization  spy  trading  stocks  low-latency  clocks  internet 
june 2013 by jm
Extreme Performance with Java - Charlie Hunt [slides, PDF]
presentation slides for Charlie Hunt's 2012 QCon presentation, where he discusses 'what you need to know about a modern JVM in order
to be effective at writing a low latency Java application'. The talk video is at
low-latency  charlie-hunt  performance  java  jvm  presentations  qcon  slides  pdf 
january 2013 by jm
Trident: a high-level abstraction for realtime computation
built on Storm:

Trident is a new high-level abstraction for doing realtime computing on top of Twitter Storm, available in Storm 0.8.0. It allows you to seamlessly mix high throughput (millions of messages per second), stateful stream processing with low latency distributed querying. If you're familiar with high level batch processing tools like Pig or Cascading, the concepts of Trident will be very familiar - Trident has joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful, incremental processing on top of any database or persistence store. Trident has consistent, exactly-once semantics, so it is easy to reason about Trident topologies.
distributed  realtime  twitter  storm  trident  distcomp  stream-processing  low-latency  nathan-marz 
october 2012 by jm

related tags

60-minutes  amazon  anti-spam  architecture  async  aws  binary  blade  c++  cascading  cep  charlie-hunt  clocks  clustering  codecs  coding  data-processing  data-structures  databases  dax  disruptor  distcomp  distributed  dynamodb  encoding  escape-analysis  ethernet  event-processing  events  exploits  facebook  fast-path  fault-tolerance  finance  formats  fpga  fpgas  gc  go  google  hacks  hadoop  hft  insane  internet  ipc  java  java-8  jeff-dean  json  jvm  lambdas  latency  load-balancing  logging  low-latency  mappedbus  market  markets  martin-thompson  message-bus  message-passing  messages  messaging  michael-lewis  millwheel  mmap  multi-az  multi-region  nanex  nathan-marz  networking  ntp  open-source  optimization  papers  pdf  performance  persistence  peter-lawrey  presentations  protobuf  protocol-buffers  pub-sub  python  qcon  queue  rabbitmq  realtime  regulation  replication  sbe  scalability  scams  scheduling  sec  serialization  slides  soa  speed  spy  stock-markets  stock-trading  stocks  storage  storm  stream-processing  synchronization  tail-latencies  talks  time  trading  trident  twitter  via:martin-thompson  via:nelson  wormhole  xml  youtube 

Copy this bookmark: