mpm + performance   164

Hash table tradeoffs: CPU, memory, and variability
In this post, I describe a tri-factor model which I find more useful in the analysis of hash table algorithms and discuss several state-of-the-art algorithms in the context of this model.
datastructure  memory  performance 
8 days ago by mpm
Partisan is the design of an alternative runtime system for improved scalability and reduced latency in actor applications
erlang  actors  performance  overlay 
5 weeks ago by mpm
Distributed Transactional Systems Cannot Be Fast
We prove that no fully transactional system can provide fast read transactions (including read-only ones that are considered the most frequent in practice). Specifically, to achieve fast read transactions, the system has to give up support of transactions that write more than one object. We prove this impossibility result for distributed storage systems that are causally consistent, i.e., they do not require to ensure any strong form of consistency. Therefore, our result holds also for any system that ensures a consistency level stronger than causal consistency, e.g., strict serializability. The impossibility result holds even for systems that store only two objects (and support at least two servers and at least four clients). It also holds for systems that are partially replicated. Our result justifies the design choices of state-of-the-art distributed transactional systems and insists that system designers should not put more effort to design fully-functional systems that support both fast read transactions and ensure causal or any stronger form of consistency.
performance  database  consistency 
8 weeks ago by mpm
Leveraging eBPF for programmable network functions with IPv6 Segment Routing
With the advent of Software Defined Networks (SDN), Network Function Virtualisation (NFV) or Service Function Chaining (SFC), operators expect networks to support flexible services beyond the mere forwarding of packets. The network programmability framework which is being developed within the IETF by leveraging IPv6 Segment Routing enables the realisation of in-network functions. In this paper, we demonstrate that this vision of in-network programmability can be realised. By leveraging the eBPF support in the Linux kernel, we implement a flexible framework that allows network operators to encode their own network functions as eBPF code that is automatically executed while processing specific packets. Our lab measurements indicate that the overhead of calling such eBPF functions remains acceptable. Thanks to eBPF, operators can implement a variety of network functions. We describe the architecture of our implementation in the Linux kernel. This extension has been released with Linux 4.18. We illustrate the flexibility of our approach with three different use cases: delay measurements, hybrid networks and network discovery. Our lab measurements also indicate that the performance penalty of running eBPF network functions on Linux routers does not incur a significant overhead.
ebpf  networking  performance 
9 weeks ago by mpm
Performance Tuning Handbook
This handbook is a guide to achieving consistent and reliable low-latency operation of applications. It explores how modern operating systems schedule user applications and how to observe and modify application operating parameters; and introduces techniques for making sure that system hardware is working cooperatively with your application
linux  performance 
9 weeks ago by mpm
Efficient, Consistent Distributed Computation with Predictive Treaties
To achieve good performance, modern applications often par-tition their state across multiple geographically distributednodes. While this approach reduces latency in the commoncase, it can be challenging for programmers to use correctly,especially in applications that require strong consistency. Weintroducepredictive treaties, a mechanism that can signifi-cantly reduce distributed coordination without losing strongconsistency. The central insight behind our approach is thatmany computations can be expressed in terms of predicatesover distributed state that can be partitioned and enforcedlocally. Predictive treaties improve on previous work by al-lowing the locally enforced predicates to depend on time.Intuitively, by predicting the evolution of system state, coor-dination can be significantly reduced compared to static ap-proaches. We implemented predictive treaties in a distributedsystem that exposes them in an intuitive programming model.We evaluate performance on several benchmarks, includingTPC-C, showing that predictive treaties can significantly in-crease performance by orders of magnitude and can evenoutperform customized algorithms.
consistency  performance  coordination 
9 weeks ago by mpm
I/O Is Faster Than the CPU – Let’s Partition Resources and Eliminate (Most) OS Abstractions
We therefore proposea structure for an OS called parakernel, which eliminates most OS abstractions and provides interfaces for applications to leverage the full potential of the underlying hardware. The parakernel facilitates application-level parallelism by securely partitioning the resources and multiplexing only those resources that are not partitioned.
memory  performance  virtualization 
may 2019 by mpm
This brief tutorial shows where some of the most used and quoted sysctl/network parameters are located into the Linux network flow
linux  networking  performance 
april 2019 by mpm
Writing User Space Network Drivers
We present ixy, a userspace network driver designed for simplicity and educationalpurposes to show that fast packet IO is not black magic butcareful engineering. Ixy focuses on the bare essentials of user space packet processing: a packet forwarder including the whole NIC driver uses less than 1000 lines of C code.
networking  performance 
march 2019 by mpm
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. The feature is currently implemented for TCP sockets.
linux  networking  performance  memory 
january 2019 by mpm
Optimizing UDP for content delivery: GSO, pacing and zerocopy
UDP is a popular basis for the experimentation with and rapid deployment of new protocols. Experience shows a large gap in cycle efficiency compared to in-kernel TCP. This talk presents recent optimizations to the UDP stack that narrow this gap. At the core are optimizations long available to TCP: segmentation offload, pacing and zerocopy. We present the recently merged UDP GSO and SO_TXTIME interfaces and discuss select implementation details. We also review partial GSO as it fits in this context and discuss optimizations that are in submission: UDP zerocopy transmit with MSG_ZEROCOPY and UDP GRO.
linux  udp  networking  performance 
january 2019 by mpm
The eXpress Data Path: Fast Programmable Packet Processing in the Operating System Kernel
In XDP, the operating system kernel itself provides a safe execution environment for custom packet processing applications, executed in device driver context.
ebpf  linux  networking  performance 
january 2019 by mpm
Learning-based Dynamic Pinning of Parallelized Applications in Many-Core Systems
This paper introduces a reinforcement-learning based resource allocation framework for dynamic placement of threads of parallel applications to Non-Uniform Memory Access (NUMA) many-core systems. We propose a two-level learning-based decision making process, where at the first level each thread independently decides on which group of cores (NUMA node) it will execute, and on the second level it decides to which particular core from the group it will be pinned. Additionally, a novel performance-based learning dynamics is introduced to handle measurement noise and rapid variations in the performance of the threads. Experiments on a 24-core system show the improvement of up to 16% in the execution time of parallel applications under our framework, compared to the Linux operating system scheduler.
scheduling  performance 
august 2018 by mpm
OpenFastPath is an open source implementation of a high performance TCP/IP stack that provides features that network application developers need to cope with today’s fast-paced network.
networking  tcp  performance 
august 2018 by mpm
Mage: Online Interference-Aware Scheduling in Multi-Scale Heterogeneous Systems
In this episode, we discuss how to define and manage SLOs for services with dependencies, each of which may (or may not!) have their own SLOs.
scheduling  performance 
july 2018 by mpm
Open-sourcing Katran, a scalable network load balancer
Katran offers a software-based solution to load balancing with a completely reengineered forwarding plane that takes advantage of two recent innovations in kernel engineering: eXpress Data Path (XDP) and the eBPF virtual machine.
networking  performance  ebpf  linux 
june 2018 by mpm
LuaJIT + DPDK = fast and flexible packet processing at speeds above 100 Gbit/s.
lua  networking  performance 
april 2018 by mpm
Less hashing, same performance: Building a better Bloom filter
A standard technique from the hashing literature is to use two hash functions h1(x) and h2(x) to simulate additional hash functions of the form gi(x) = h1(x) + ih2(x). We demonstrate that this technique can be usefully applied to Bloom filters and related data structures. Specifically, only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability. This leads to less computation and potentially less need for randomness in practice
bloomfilter  performance 
march 2018 by mpm
This project is a KDAB R&D effort to create a standalone GUI for performance data. As the first goal, we want to provide a UI like KCachegrind around Linux perf. Looking ahead, we intend to support various other performance data formats under this umbrella.
february 2018 by mpm
Benchmarking and Analysis of Software Data Planes
This technical paper introduces the main concepts underpinning a systematic methodology for benchmarking and analysis of compute-native network SW data plane performance running on modern compute server HW
networking  performance  memory 
february 2018 by mpm
Simple userspace packet processing for educational purposes
networking  performance 
january 2018 by mpm
About Microservices, Containers and their Underestimated Impact on Network Performance
Microservices are used to build complex applications composed of small, independent and highly decoupled processes. Recently, microservices are often mentioned in one breath with container technologies like Docker. That is why operating system virtualization experiences a renaissance in cloud computing. These approaches shall provide horizontally scalable, easily deployable systems and a high-performance alternative to hypervisors. Nevertheless, performance impacts of containers on top of hypervisors are hardly investigated. Furthermore, microservice frameworks often come along with software defined networks. This contribution presents benchmark results to quantify the impacts of container, software defined networking and encryption on network performance. Even containers, although postulated to be lightweight, show a noteworthy impact to network performance. These impacts can be minimized on several system layers. Some design recommendations for cloud deployed systems following the microservice architecture pattern are derived.
networking  performance 
december 2017 by mpm
I Wrote The Fastest Hashtable
The trick is to use Robin Hood hashing with an upper limit on the number of probes. If an element has to be more than X positions away from its ideal position, you grow the table and hope that with a bigger table every element can be close to where it wants to be. Turns out that this works really well. X can be relatively small which allows some nice optimizations for the inner loop of a hashtable lookup.
datastructure  performance 
october 2017 by mpm
Dynamic tracing in Linux user and kernel space
Did you forget to insert probe points in your code? No problem. Learn how to insert them dynamically with uprobe and kprobe
observability  linux  debugging  performance 
july 2017 by mpm
usl4j And You
In this post, I’ll introduce you to Little’s Law, the Universal Scalability Law, and usl4j, a Java library for modeling system performance given a small set of real-world measurements.
scalability  performance 
june 2017 by mpm
BBR, the new kid on the TCP block
BBR is very similar to TCP Vegas, in that it is attempting to operate the TCP session at the point of onset of queuing at the path bottleneck. The specific issues being addressed by BBR is that the determination of both the underlying bottleneck available bandwidth and path RTT is influenced by a number of factors in addition to the data being passed through the network for this particular flow, and once BBR has determined its sustainable capacity for the flow, it attempts to actively defend it in order to prevent it from being crowded out by the concurrent operation of conventional AIMD protocols.
networking  tcp  protocol  performance 
may 2017 by mpm
Spanner vs. Calvin
I found it very difficult to find cases where an ideal implementation of Spanner theoretically outperforms an ideal implementation of Calvin.
consistency  consensus  database  performance  scalability 
april 2017 by mpm
Attack of the Killer Microseconds
Efficient handling of microsecond-scale events is becoming paramount for a new breed of low-latency I/O devices ranging from datacenter networking to emerging memories
april 2017 by mpm
Storage Performance Development Kit
The Storage Performance Development Kit (SPDK) provides a set of tools and libraries for writing high performance, scalable, user-mode storage applications. It achieves high performance by moving all of the necessary drivers into userspace and operating in a polled mode instead of relying on interrupts, which avoids kernel context switches and eliminates interrupt handling overhead
storage  performance 
march 2017 by mpm
SR-IOV for NFV Solutions
There are a number of published articles and papers from various Ethernet vendors touting their
SR-IOV solutions as being ideal for NFV; some focus on “Smart NIC” type of capabilities, others have
vSwitch offloading, and others speak of raw packet performance. Some of the data is compelling
enough to warrant closer examination. This paper outlines the results of that examination and
networking  performance 
march 2017 by mpm
Boost UDP Transaction Performance
Kernel configuration to fix udp latency & throughput issues
udp  performance  networking  actors 
february 2017 by mpm
At a high level, “perf c2c” will show you:
* The cachelines where false sharing was detected.
* The readers and writers to those cachelines, and the offsets where those accesses occurred.
linux  performance  memory 
february 2017 by mpm
Large Pages May Be Harmful on NUMA Systems
On NUMA systems the memory is spread across several physical nodes; using large pages may contribute to the imbalance in the distribution of memory controller requests and reduced locality of accesses, both of which can drive up memory latencies
linux  performance  memory 
february 2017 by mpm
Monitoring and Tuning the Linux Networking Stack: Sending Data
This blog post explains how computers running the Linux kernel send packets, as well as how to monitor and tune each component of the networking stack as packets flow from user programs to network hardware.
linux  networking  performance  monitoring 
february 2017 by mpm
BBR: Congestion-Based Congestion Control
Today TCP's loss-based congestion control is the primary cause of these problems. When bottleneck buffers are large, loss-based congestion control keeps them full, causing bufferbloat. When bottleneck buffers are small, loss-based congestion control misinterprets loss as a signal of congestion, leading to low throughput. Fixing these problems requires an alternative to loss-based congestion control
networking  performance  protocol 
december 2016 by mpm
uftrace is a program that simply runs the specified command until it exits. It intercepts and records the user define function calls which are called by the executed process.
linux  performance 
october 2016 by mpm
Packet mmap
This file documents the mmap() facility available with the PACKET socket interface on 2.4/2.6/3.x kernels
linux  networking  performance 
october 2016 by mpm
Capturing Packets in Linux at a Speed of Millions of Packets per Second without Using Third Party Libraries
My article will tell you how to accept 10 million packets per second without using such libraries as Netmap, PF_RING, DPDK and other. We are going to do this with Linux kernel version 3.16 and some code in C and C++.
networking  performance  linux 
october 2016 by mpm
Pitfalls of TSC usage
The TSC (time stamp counter) provided by x86 processors is a high-resolution counter that can be read with a single instruction (RDTSC)
time  performance 
september 2016 by mpm
Turtles on the wire: understanding how the OS uses the modern NIC
We're going to be focusing on how the Operating System sees NICs, what abstractions they provide together, how things have changed to deal with the increased tenancy and performance demands
networking  performance 
september 2016 by mpm
UdpPinger is our high performance UDP packet generation, reflection and collection library
networking  performance  testing 
august 2016 by mpm
Monitoring and Tuning the Linux Networking Stack: Receiving Data
This blog post explains how computers running the Linux kernel receive packets, as well as how to monitor and tune each component of the networking stack as packets flow from the network toward userland programs
linux  networking  monitoring  performance 
june 2016 by mpm
Much Faster Networking
David Riddoch talks about technologies that make very high performance networking possible on commodity servers and networks, with a special focus on kernel bypass technologies including sockets acceleration and NFV
networking  performance 
may 2016 by mpm
LightweighT Almost Lock-Less Oriented for C++ programs memory allocator
c++  memory  performance 
may 2016 by mpm
Fast Non-intrusive Memory Reclamation for Highly-Concurrent Data Structures
This paper proposes three novel ways to alleviate the costs of the memory barriers associated with hazard pointers and related techniques. These new proposals are backward-compatible with existing code that uses hazard pointers. They move the cost of memory management from the principal code path to the infrequent memory reclamation procedure, significantly reducing or eliminating memory barriers executed on the principal code path.
memory  performance  non-blocking 
may 2016 by mpm
XRay: A Function Call Tracing System
Debugging high throughput, low-latency C/C++ systems in production is hard. At Google we developed XRay, a function call tracing system that allows Google engineers to get accurate function call traces with negligible overhead when off and moderate overhead when on, suitable for services deployed in production. XRay enables efficient function call entry/exit logging with high accuracy timestamps, and can be dynamically enabled and disabled. This white paper describes the XRay tracing system and its implementation. It also describes future plans with open sourcing XRay and engaging open source communities
performance  c++  compiler 
may 2016 by mpm
Diskspd Utility
A feature-rich and versatile storage testing tool, Diskspd (version 2.0.15) combines robust and granular IO workload definition with flexible runtime and output options, creating an ideal tool for synthetic storage subsystem testing and validation.
performance  testing  io 
april 2016 by mpm
Observe the execution of any software. These tools use static and dynamic tracing of both user- and kernel-level code (via kprobes, uprobes, tracepoints, and USDT).
linux  performance  monitoring 
april 2016 by mpm
OpenFastPath is an open source implementation of a high performance TCP/IP stack that provides features that network application developers need to cope with today’s fast-paced network
networking  performance  tcp 
april 2016 by mpm
high frequency performance measurements for Linux
linux  performance 
march 2016 by mpm
Controlling Queue Delay
The solution for persistently full buffers, AQM (active queue management), has been known for two decades but has not been widely deployed because of implementation difficulties and general misunderstanding about Internet packet loss and queue dynamics. Unmanaged buffers are more critical today since buffer sizes are larger, delay-sensitive applications are more prevalent, and large (streaming) downloads common. The continued existence of extreme delays at the Internet’s edge can impact its usef...
networking  performance  memory 
january 2016 by mpm
Array Layouts for Comparison-Based Searching
We attempt to determine the best order and search algorithm to store n comparable data items in an array, A, of length n so that we can, for any query value, x, quickly find the smallest value in A that is greater than or equal to x. In particular, we consider the important case where there are many such queries to the same array, A, which resides entirely in RAM. In addition to the obvious sorted order/binary search combination we consider the Eytzinger (BFS) layout normally used for heaps, an implicit B-tree layout that generalizes the Eytzinger layout, and the van Emde Boas layout commonly used in the cache-oblivious algorithms literature.
After extensive testing and tuning on a wide variety of modern hardware, we arrive at the conclusion that, for small values of n, sorted order, combined with a good implementation of binary search is best. For larger values of n, we arrive at the surprising conclusion that the Eytzinger layout is usually the fastest. The latter conclusion is unexpected and goes counter to earlier experimental work by Brodal, Fagerberg, and Jacob (SODA~2003), who concluded that both the B-tree and van Emde Boas layouts were faster than the Eytzinger layout for large values of n.
memory  performance 
january 2016 by mpm
Each object stores a hashed string value and a pointer to the database in which the original string value is stored. This allows retrieving the string value when needed while also getting the performance benefits from hashed strings
c++  performance 
december 2015 by mpm
ktap is a new script-based dynamic tracing tool for Linux, it uses a scripting language and lets users trace the Linux kernel dynamically. ktap is designed to give operational insights with interoperability that allows users to tune, troubleshoot and extend kernel and application. It's similar with Linux Systemtap and Solaris Dtrace
linux  performance  observability  lua 
november 2015 by mpm
Causal profiling measures optimization potential for serial, parallel, and asynchronous programs without instrumentation of special handling for library calls and concurrency primitives. Instead, a causal profiler uses performance experiments to predict the effect of optimizations. This allows the profiler to establish causality: "optimizing function X will have effect Y," exactly the measurement developers had assumed they were getting all along
november 2015 by mpm
High-performance multicore-scalable data structures and benchmarks
c++  concurrency  performance  testing 
october 2015 by mpm
Snabb Switch: kernel-bypass networking illustrated
Snabb Switch is a networking application that runs on Linux. However, it does not typically use Linux's networking functionality. Instead it negotiates with the kernel to take control of whole PCI network devices and perform I/O directly without using the kernel as a middle-man. This is kernel-bypass networking
linux  networking  performance 
october 2015 by mpm
The problem of concurrent memory allocation is to find the right balance between temporal and spatial performance and scalability across a large range of workloads. Our contributions to address this problem are: uniform treatment of small and big objects through the idea of virtual spans, efficiently and effectively reclaiming unused memory through fast and scalable global data structures, and constant-time (modulo synchronization) allocation and deallocation operations that trade off memory reuse and spatial locality without being subject to false sharing
memory  concurrency  performance 
october 2015 by mpm
NUMA-Aware Reader-Writer Locks
Non-Uniform Memory Access (NUMA) architectures are gaining importance in mainstream computing systems due to the rapid growth of multi-core multi-chip machines. Extracting the best possible performance from these new machines will require us to revisit the design of the concurrent algorithms and synchronization primitives which form the building blocks of many of today’s applications. This paper revisits one such critical synchronization primitive – the reader-writer lock. We present what is, to the best of our knowledge, the first family of reader-writer lock algorithms tailored to NUMA architectures. We present several variations which trade fairness between readers and writers for higher concurrency among readers and better back-to-back batching of writers from the same NUMA node. Our algorithms leverage the lock cohorting technique to manage synchronization between writers in a NUMA-friendly fashion, binary flags to coordinate readers and writers, and simple distributed reader counter implementations to enable NUMA-friendly concurrency among readers. The end result is a collection of surprisingly simple NUMA-aware algorithms that outperform the state-of-theart reader-writer locks by up to a factor of 10 in our microbenchmark experiments. To evaluate our algorithms in a realistic setting we also present performance results of the kccachetest benchmark of the Kyoto-Cabinet distribution, an open-source database which makes heavy use of pthread reader-writer locks. Our locks boost the performance of kccachetest by up to 40 % over the best prior alternatives
concurrency  performance 
october 2015 by mpm
Speeding up Packet I/O in Virtual Machines
Most of the work on VM network performance has focused so far on bulk TCP traffic, which covers classical applications of virtualization. Completely new "paravirtualized devices" (Xenfront, VIRTIO, vmxnet) have been designed and implemented to improve network throughput
networking  performance 
october 2015 by mpm
Common Trace Format
The Common Trace Format (CTF) is a binary trace format designed to be very fast to write without compromising great flexibility. It allows traces to be natively generated by any C/C++ application or system, as well as by bare-metal (hardware) components
performance  monitoring  logging 
october 2015 by mpm
Log2: A Cost-Aware Logging Mechanism for Performance Diagnosis
Logging has been a common practice for monitoring and diagnosing performance issues. However, logging comes at a cost, especially for large-scale online service systems. First, the overhead incurred by intensive logging is non-negligible. Second, it is costly to diagnose a performance
issue if there is a tremendous amount of redundant logs. Therefore, we believe that it is important to limit the overhead incurred by logging, without sacrificing the logging effectiveness. In this paper we propose Log2, a cost-aware logging mechanism. Given a “budget” (defined as the maximum volume of logs allowed to be outputted in a time interval), Log2 makes the “whether to log” decision through a two-phase filtering mechanism. In the first phase, a large number of irrelevant logs are discarded efficiently. In the second phase, useful logs are cached and outputted while complying with logging budget. In this way, Log2 keeps the useful logs and discards the less useful ones. We have implemented Log2 and evaluated it on an open source system as well as a real-world online service system from Microsoft. The experimental results show that Log2 can control logging overhead while preserving logging effectiveness
logging  performance 
july 2015 by mpm
How to achieve low latency with 10Gbps Ethernet
[How to] optimize our UDP application for latency
linux  networking  performance  udp 
july 2015 by mpm
PerfJ is a wrapper of linux perf for java programs
java  performance  linux 
june 2015 by mpm
How to receive a million packets per second
On Linux, how hard is it to write a program that receives 1 million UDP packets per second? Hopefully, answering this question will be a good lesson about the design of a modern networking stack
linux  networking  performance 
june 2015 by mpm
MoldUDP is a networking protocol that allows efficient and scaleable transmission of data messages in a "one transmitter to many listeners" scenario. MoldUDP is a lightweight protocol layer built on top of UDP that provides a mechanism for listeners to detect and re-request missed packets.
networking  performance  udp 
may 2015 by mpm
Scaling Concurrent Log-Structured Data Stores
Log-structured data stores (LSM-DSs) are widely accepted as the state-of-the-art implementation of key-value stores. They replace random disk writes with sequential I/O, by accumulating large batches of updates in an in-memory data structure and merging it with the on-disk store in the background. While LSM-DS implementations proved to be highly successful at masking the I/O bottleneck, scaling them up on multicore CPUs remains a challenge. This is nontrivial due to their often rich APIs, as well as the need to coordinate the RAM access with the background I/O.

We present cLSM, an algorithm for scalable concurrency in LSM-DS, which exploits multiprocessor-friendly data structures and non-blocking synchronization. cLSM supports a rich API, including consistent snapshot scans and general non-blocking read-modify-write operations.

We implement cLSM based on the popular LevelDB key-value store, and evaluate it using intensive synthetic workloads as well as ones from production web-serving applications. Our algorithm outperforms state of the art LSM-DS implementations, improving throughput by 1.5x to 2.5x. Moreover, cLSM demonstrates superior scalability with the number of cores (successfully exploiting twice as many cores as the competition)
database  storage  performance 
may 2015 by mpm
Systematic Process to Reduce Linux OS Jitter
Finding the cause of hiccups/jitters in a a Linux system is black magic
linux  performance 
april 2015 by mpm
A complete ftrace- and uprobes-based tracer (user, libraries, kernel) for GNU/Linux
linux  performance  testing  managability 
march 2015 by mpm
Exponential Backoff And Jitter
The solution isn’t to remove backoff. It’s to add jitter. Initially, jitter may appear to be a counter-intuitive idea: trying to improve the performance of a system by adding randomness. The time series above makes a great case for jitter – we want to spread out the spikes to an approximately constant rate.
performance  concurrency 
march 2015 by mpm
a toolkit for building high-performance servers
performance  networking  concurrency 
february 2015 by mpm
DPDK is a set of libraries and drivers for fast packet processing
linux  networking  performance 
february 2015 by mpm
Fast Non-Standard Data Structures for Python
Python provides great built-in types like dict, list, tuple and set; there are also array, collections, heapq modules in the standard library; this article is an overview of external lesser known packages with fast C/C++ based data structures usable from Python
python  performance 
february 2015 by mpm
From ARIES to MARS: Reengineering Transaction Management for Next-Generation, Solid-State Drives
Systems that provide powerful transaction mechanisms often rely on write-ahead logging (WAL) implementations that were designed with slow, disk-based systems in mind. The emerging class of fast, byte-addressable, non-volatile memory (NVM) technologies (e.g., phase change memories, spin-torque MRAMs, and the memristor), however, present performance characteristics very different from both disks and flash-based SSDs. This paper addresses the problem of designing a WAL scheme optimized for these fast NVM-based storage systems. We examine the features that a system like ARIES, a WAL algorithm popular for databases, must provide and separate them from the implementation decisions ARIES makes to optimize for disk-based systems. We design a new NVMoptimized WAL scheme (called MARS) in tandem with a novel SSD multi-part atomic write primitive that combine to provide the same features as ARIES does without any of the disk-centric limitations. The new atomic write primitive makes the log’s contents visible to the application, allowing for a simpler and faster implementation. MARS provides atomicity, durability, and high performance by leveraging the enormous internal bandwidth and high degree of parallelism that advanced SSDs will provide. We have implemented MARS and the novel visible atomic write primitive in a next-generation SSD. This paper demonstrates the overhead of the primitive is minimal compared to normal writes, and our hardware provides large speedups for transactional updates to hash tables, b-trees, and large graphs. MARS outperforms ARIES by up to 3.7 ⇥ while reducing software complexity
database  io  performance 
january 2015 by mpm
Nah Lock: A Lock-Free Memory Allocator
The first allocator had poor scaling on par with libc, but I learned enough from it to write a second lockfree allocator that scales approximately linearly up to 30 cores. It scales sublinearly but slightly better than tcmalloc up to 64 cores
memory  performance  non-blocking 
november 2014 by mpm
Efficient reliable unicast and multicast transport protocol.
messaging  performance 
november 2014 by mpm
The SprayList: A Scalable Relaxed Priority Queue
High-performance concurrent priority queues are essential for applications such as task scheduling and discrete event simulation. Unfortunately even the best performing implementations do not scale past a number of threads in the single digits. This is because of the sequential bottleneck in accessing the elements at the head of the queue in order to perform a DeleteMin operation. In this paper, we present the SprayList, a scalable priority queue with relaxed ordering semantics. Starting from a nonblocking SkipList, the main innovation behind our design is that the DeleteMin operations avoid a sequential bottleneck by “spraying” themselves onto the head of the SkipList list in a coordinated fashion. The spraying is implemented using a carefully designed random walk, so that DeleteMin always returns an element among the first O(p polylog p) in the list, where p is the number of threads. We prove that the expected running time of a DeleteMin operation is poly-logarithmic in p, independent of the size of the list, and also provide analytic upper bounds on the number of possible priority inversions for an element. Our experiments show that the relaxed semantics allow the data structure to scale for very high thread counts, comparable to a classic unordered SkipList. Furthermore, we observe that, for reasonably parallel workloads, the scalability benefits of relaxation considerably outweigh the additional work due to out-of-order execution.
concurrency  performance  queuing 
november 2014 by mpm
MultiQueues: Simpler, Faster, and Better Relaxed Concurrent Priority Queues
Priority queues with parallel access are an attractive data structure for applications like prioritized online scheduling, discrete event simulation, or branch-and-bound. However, a classical priority queue constitutes a severe bottleneck in this context, leading to very small throughput. Hence, there has been significant interest in concurrent priority queues with a somewhat relaxed semantics where deleted elements only need to be close to the minimum. In this paper we present a very simple approach based on multiple sequential priority queues. It turns out to outperform previous more complicated data structures while at the same time improving the quality of the returned elements.
concurrency  performance 
november 2014 by mpm
Fastsocket is a scalable kernel TCP socket implementation and achieves a straight linear performance growth when scaling up to 24 CPU cores
linux  networking  performance  tcp 
october 2014 by mpm
« earlier      
per page:    204080120160

Copy this bookmark: