A Branchless UTF-8 Decoder
This week I took a crack at writing a branchless UTF-8 decoder: a function that decodes a single UTF-8 code point from a byte stream without any if statements, loops, short-circuit operators, or other sorts of conditional jumps. [...] Why branchless? Because high performance CPUs are pipelined. That is, a single instruction is executed over a series of stages, and many instructions are executed in overlapping time intervals, each at a different stage.

Neat hack (via Tony Finch)
Intel pcj library for persistent memory-oriented data structures
This is a "pilot" project to develop a library for Java objects stored in persistent memory. Persistent collections are being emphasized because many applications for persistent memory seem to map well to the use of collections. One of this project's goals is to make programming with persistent objects feel natural to a Java developer, for example, by using familiar Java constructs when incorporating persistence elements such as data consistency and object lifetime.

The breadth of persistent types is currently limited and the code is not performance-optimized. We are making the code available because we believe it can be useful in experiments to retrofit existing Java code to use persistent memory and to explore persistent Java programming in general.

(via Mario Fusco)
a new common C++ library from Google, Apache-licensed.
LambCI — a serverless build system
Run CI builds on Lambda:
LambCI is a tool I began building over a year ago to run tests on our pull requests and branches at Uniqlo Mobile. Inspired at the inaugural ServerlessConf a few weeks ago, I recently put some work into hammering it into shape for public consumption.
It was borne of a dissatisfaction with the two current choices for automated testing on private projects. You can either pay for it as a service (Travis, CircleCI, etc) — where 3 developers needing their own build containers might set you back a few hundred dollars a month. Or you can setup a system like Jenkins, Strider, etc and configure and manage a database, a web server and a cluster of build servers .
In both cases you’ll be under- or overutilized, waiting for servers to free up or paying for server power you’re not using. And this, for me, is where the advantage of a serverless architecture really comes to light: 100% utilization, coupled with instant invocations.
Native Memory Tracking
Java 8 HotSpot feature to monitor and diagnose native memory leaks
How to Optimize Garbage Collection in Go
In this post, we’ll share a few powerful optimizations that mitigate many of the performance problems common to Go’s garbage collection (we will cover “fun with deadlocks” in a follow-up). In particular, we’ll share how embedding structs, using sync.Pool, and reusing backing arrays can minimize memory allocations and reduce garbage collection overhead.
Teaching Students to Code - What Works
Lynn Langit describing her work as part of Microsoft Digigirlz and TKP to teach thousands of kids worldwide to code. Describes a curriculum from "K" (4-6-year olds) learning computational thinking with a block-based programming environment like Scratch, up to University level, solving problems with public clouds like AWS' free tier.
GTK+ switches build from Autotools to Meson
'The main change is that now GTK+ takes about ⅓ of the time to build
compared to the Autotools build, with likely bigger wins on older/less
powerful hardware; the Visual Studio support on Windows should be at
least a couple of orders of magnitude easier (shout out to Fan
Chun-wei for having spent so, so many hours ensuring that we could
even build on Windows with Visual Studio and MSVC); and maintaining
the build system should be equally easier for everyone on any platform
we currently support.'

Looking at it appears to be Python-based and
AL2-licensed open source.

On the downside, though, the Meson file is basically a Python script,
which is something I'm really not fond of :( more details at .
Foursquare's open source repo, where they extract reusable components for open sourcing -- I like the approach of using a separate top level module path for OSS bits
A general purpose counting filter
This paper introduces a new AMQ data structure, a Counting Quotient Filter, which addresses all of these shortcomings and performs extremely well in both time and space: CQF performs in-memory inserts and queries up to an order of magnitude faster than the original quotient filter structure from which it takes its inspiration, several times faster than a Bloom filter, and similarly to a cuckoo filter. The CQF structure is comparable or more space efficient than all of them too. Moreover, CQF does all of this while supporting counting, outperforming all of the other forms in both dimensions even though they do not. In short, CQF is a big deal!
Pipeline Development Tools
"" -- command line Jenkins pipeline linting
terrible review for Solidity as a programming environment in HN
"Solidity/EVM is by far the worst programming environment I have ever encountered. It would be impossible to write even toy programs correctly in this language, yet it is literally called "Solidity" and used to program a financial system that manages hundreds of millions of dollars."

Via Tony Finch
Undefined Behavior in 2017
This is an extremely detailed post on the state of dynamic checkers in C/C++ (via the inimitable Marc Brooker):
Recently we’ve heard a few people imply that problems stemming from undefined behaviors (UB) in C and C++ are largely solved due to ubiquitous availability of dynamic checking tools such as ASan, UBSan, MSan, and TSan. We are here to state the obvious — that, despite the many excellent advances in tooling over the last few years, UB-related problems are far from solved — and to look at the current situation in detail.
Exactly-once Support in Apache Kafka – Jay Kreps
If you’re one of the people who think [exactly-once support is impossible], I’d ask you to take an actual look at what we actually know to be possible and impossible, and what has been built in Kafka, and hopefully come to a more informed opinion. So let’s address this in two parts. First, is exactly-once a theoretical impossibility? Second, how does Kafka support it.
A Brief History of the UUID · Segment Blog
This is great, by Rick Branson. I didn't realise UUIDs came from Apollo
An empirical study on the correctness of formally verified distributed systems
We must recognise that even formal verification can leave gaps and hidden assumptions that need to be teased out and tested, using the full battery of testing techniques at our disposal. Building distributed systems is hard. But knowing that shouldn’t make us shy away from trying to do the right thing, instead it should make us redouble our efforts in our quest for correctness.
Enough with the microservices
Good post!
Much has been written on the pros and cons of microservices, but unfortunately I’m still seeing them as something being pursued in a cargo cult fashion in the growth-stage startup world. At the risk of rewriting Martin Fowler’s Microservice Premium article, I thought it would be good to write up some thoughts so that I can send them to clients when the topic arises, and hopefully help people avoid some of the mistakes I’ve seen. The mistake of choosing a path towards a given architecture or technology on the basis of so-called best practices articles found online is a costly one, and if I can help a single company avoid it then writing this will have been worth it.
Towards true continuous integration – Netflix TechBlog – Medium
Netflix discuss how they handle the eternal dependency-management problem which arises with lots of microservices:
Using the monorepo as our requirements specification, we began exploring alternative approaches to achieving the same benefits. What are the core problems that a monorepo approach strives to solve? Can we develop a solution that works within the confines of a traditional binary integration world, where code is shared? Our approach, while still experimental, can be distilled into three key features:

Publisher feedback — provide the owner of shared code fast feedback as to which of their consumers they just broke, both direct and transitive. Also, allow teams to block releases based on downstream breakages. Currently, our engineering culture puts sole responsibility on consumers to resolve these issues. By giving library owners feedback on the impact they have to the rest of Netflix, we expect them to take on additional responsibility.

Managed source — provide consumers with a means to safely increment library versions automatically as new versions are released. Since we are already testing each new library release against all downstreams, why not bump consumer versions and accelerate version adoption, safely.

Distributed refactoring — provide owners of shared code a means to quickly find and globally refactor consumers of their API. We have started by issuing pull requests en masse to all Git repositories containing a consumer of a particular Java API. We’ve run some early experiments and expect to invest more in this area going forward.

What I find interesting is that Amazon dealt effectively with the first two many years ago, in the form of their "Brazil" build system, and Google do the latter (with Refaster?). It would be amazing to see such a system released into an open source form, but maybe it's just too heavyweight for anyone other than a giant software company on the scale of a Google, Netflix or Amazon.
on Martin Fowler
mcfunley: 'I think at least 50% of my career has been either contributing to or unwinding one [Martin] Fowler-inspired disaster or another.'

See also: continuous deployment, polyglot programming, microservices

Relevant meme:
A Programmer’s Introduction to Unicode – Nathan Reed’s coding blog
Fascinating Unicode details -- a lot of which were new to me. Love the heat map of usage in Wikipedia:
One more interesting way to visualize the codespace is to look at the distribution of usage—in other words, how often each code point is actually used in real-world texts. Below is a heat map of planes 0–2 based on a large sample of text from Wikipedia and Twitter (all languages). Frequency increases from black (never seen) through red and yellow to white.

You can see that the vast majority of this text sample lies in the BMP, with only scattered usage of code points from planes 1–2. The biggest exception is emoji, which show up here as the several bright squares in the bottom row of plane 1.
Hadoop Internals
This is the best documentation on the topic I've seen in a while
'Software Engineering at Google'
20 pages of Google's software dev practices, with emphasis on the build system (since it was written by the guy behind Blaze). Naturally, some don't make a whole lot of sense outside of Google, but still some good stuff here
The Rise of the Data Engineer
Interesting article proposing a new discipline, focused on the data warehouse, from Maxime Beauchemin (creator and main committer on Apache Airflow and Airbnb’s Superset)
'Rules of Machine Learning: Best Practices for ML Engineering' from Martin Zinkevich
'This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machine­-learned model, then you have the necessary background to read this document.'

Full of good tips, if you wind up using ML in a production service.
Falsehoods Programmers Believe About CSVs
Much of my professional work for the last 10+ years has revolved around handing, importing and exporting CSV files. CSV files are frustratingly misunderstood, abused, and most of all underspecified. While RFC4180 exists, it is far from definitive and goes largely ignored.

Partially as a companion piece to my recent post about how CSV is an encoding nightmare, and partially an expression of frustration, I've decided to make a list of falsehoods programmers believe about CSVs. I recommend my previous post for a more in-depth coverage on the pains of CSVs encodings and how the default tooling (Excel) will ruin your day.

(via Tony Finch)
Reproducible research: Stripe’s approach to data science
This is intriguing -- using Jupyter notebooks to embody data analysis work, and ensure it's reproducible, which brings better rigour similarly to how unit tests improve coding. I must try this.
Reproducibility makes data science at Stripe feel like working on GitHub, where anyone can obtain and extend others’ work. Instead of islands of analysis, we share our research in a central repository of knowledge. This makes it dramatically easier for anyone on our team to work with our data science research, encouraging independent exploration.

We approach our analyses with the same rigor we apply to production code: our reports feel more like finished products, research is fleshed out and easy to understand, and there are clear programmatic steps from start to finish for every analysis.
Subreddit devoted to becoming a software developer in Ireland, with a decent wiki
november 2016 by jm - Parsing JSON is a Minefield 💣
Crockford chose not to version [the] JSON definition: 'Probably the boldest design decision I made was to not put a version number on JSON so there is no mechanism for revising it. We are stuck with JSON: whatever it is in its current form, that’s it.' Yet JSON is defined in at least six different documents.

"Boldest". ffs. :facepalm:
Simple testing can prevent most critical failures
Specifically, the following 3 classes of errors were implicated in 92% of the major production outages in this study and could have been caught with simple code review:
Error handlers that ignore errors (or just contain a log statement); error handlers with “TODO” or “FIXME” in the comment; and error handlers that catch an abstract exception type (e.g. Exception or Throwable in Java) and then take drastic action such as aborting the system.

(Interestingly, the latter was a particular favourite approach of some misplaced "fail fast"/"crash-only software design" dogma in Amazon. I wasn't a fan)
Unchecked exceptions for IO considered harmful - Google Groups
Insightful thread from the mechanical sympathy group, regarding the checked-vs-unchecked style question:
Peter Lawrey: Our view is that Checked Exception makes more sense for library writers as they can explicitly pass off errors to the caller. As a caller, especially if you are new to a product, you don't understand the exceptions or what you can do about them.  They add confusion.

For this reason we use checked exceptions internally in the lower layers and try to avoid passing them in our higher level interfaces. Note: A high percentage of our fall backs are handling iOExceptons and recovering from them. [....]

My experience is that the more complex and layered your libraries the more essential checked exceptions become. I see them as essential for scalability of your software.
Regexp Disaster
Course notes from Gerald Jay Sussman's "Adventures in Advanced Symbolic Programming" class at MIT. Hard to argue with this:
The syntax of the regular-expression language is awful. There are various incompatable forms of the language and the quotation conventions are baroquen [sic]. Nevertheless, there is a great deal of useful software, for example grep, that uses regular expressions to specify the desired behavior.

Although regular-expression systems are derived from a perfectly good mathematical formalism, the particular choices made by implementers to expand the formalism into useful software systems are often
disastrous: the quotation conventions adopted are highly irregular; the egregious misuse of parentheses, both for grouping and for backward reference, is a miracle to behold. In addition, attempts to
increase the expressive power and address shortcomings of earlier designs have led to a proliferation of incompatible derivative languages.

(via Rob Pike's twitter:
'a Ruby regular expression editor and tester'. Great for prototyping regexps with a little set of test data, providing a neat permalink for the results
IMDB on automation
quotable: "I spend a lot of time on this task. I should write a program automating it!"
Koloboke Collections
Interesting new collections lib for Java 6+; generates Map-like and Set-like collections at runtime based on the contract annotations you desire. Fat (20MB) library-based implementation also available
Lamport timestamps
'The algorithm of Lamport timestamps is a simple algorithm used to determine the order of events in a distributed computer system. As different nodes or processes will typically not be perfectly synchronized, this algorithm is used to provide a partial ordering of events with minimal overhead, and conceptually provide a starting point for the more advanced vector clock method. They are named after their creator, Leslie Lamport.'

See also vector clocks (which I think would be generally preferable nowadays).
GitLab Container Registry
GitLab continue to out-innovate Github, which is just wanking around with breaking the UI these days
Gradle plugin that allows easy integration with the infer static analyzer
Go best practices, six years in
from Peter Bourgon. Looks like a good list of what to do and what to avoid
Darts, Dice, and Coins
Earlier this year, I asked a question on Stack Overflow about a data structure for loaded dice. Specifically, I was interested in answering this question: "You are given an n-sided die where side i has probability pi of being rolled. What is the most efficient data structure for simulating rolls of the die?"

This data structure could be used for many purposes. For starters, you could use it to simulate rolls of a fair, six-sided die by assigning probability 1616 to each of the sides of the die, or a to simulate a fair coin by simulating a two-sided die where each side has probability 1212 of coming up. You could also use this data structure to directly simulate the total of two fair six-sided dice being thrown by having an 11-sided die (whose faces were 2, 3, 4, ..., 12), where each side was appropriately weighted with the probability that this total would show if you used two fair dice. However, you could also use this data structure to simulate loaded dice. For example, if you were playing craps with dice that you knew weren't perfectly fair, you might use the data structure to simulate many rolls of the dice to see what the optimal strategy would be. You could also consider simulating an imperfect roulette wheel in the same way.

Outside the domain of game-playing, you could also use this data structure in robotics simulations where sensors have known failure rates. For example, if a range sensor has a 95% chance of giving the right value back, a 4% chance of giving back a value that's too small, and a 1% chance of handing back a value that's too large, you could use this data structure to simulate readings from the sensor by generating a random outcome and simulating the sensor reading in that case.

The answer I received on Stack Overflow impressed me for two reasons. First, the solution pointed me at a powerful technique called the alias method that, under certain reasonable assumptions about the machine model, is capable of simulating rolls of the die in O(1)O(1) time after a simple preprocessing step. Second, and perhaps more surprisingly, this algorithm has been known for decades, but I had not once encountered it! Considering how much processing time is dedicated to simulation, I would have expected this technique to be better- known. A few quick Google searches turned up a wealth of information on the technique, but I couldn't find a single site that compiled together the intuition and explanation behind the technique.

(via Marc Brooker)
A Guide to Naming Variables
good rules of thumb for variable naming, from ex-coworker Jacob Gabrielson
Elias gamma coding
'used most commonly when coding integers whose upper-bound cannot be determined beforehand.'
US government commits to publish publicly financed software under Free Software licenses
Wow, this is significant:
At the end of last week, the White House published a draft for a Source Code Policy. The policy requires every public agency to publish their custom-build software as Free Software for other public agencies as well as the general public to use, study, share and improve the software. At the Free Software Foundation Europe (FSFE) we believe that the European Union, and European member states should implement similar policies. Therefore we are interested in your feedback to the US draft.
GitHub now supports "squash on merge"

On the other hand -- is a good explanation of why not to adopt it. Pity GitHub haven't made it a per-review option...
A programming language for E. coli
Mind = blown.
MIT biological engineers have created a programming language that allows them to rapidly design complex, DNA-encoded circuits that give new functions to living cells. Using this language, anyone can write a program for the function they want, such as detecting and responding to certain environmental conditions. They can then generate a DNA sequence that will achieve it.
"It is literally a programming language for bacteria," says Christopher Voigt, an MIT professor of biological engineering. "You use a text-based language, just like you're programming a computer. Then you take that text and you compile it and it turns it into a DNA sequence that you put into the cell, and the circuit runs inside the cell."
These unlucky people have names that break computers
Pat McKenzie's name is too long to fit in Japanese database schemas; Janice Keihanaikukauakahihulihe'ekahaunaele's name was too long for US schemas; and Jennifer Null suffers from the obvious problem
Javascript libraries and tools should bundle their code
If you have a million npm dependencies, distribute them in the dist package; aka. omnibus packages for JS
Uncle Bob on "giving up TDD"
This is a great point, and one I'll be quoting:
Any design that is hard to test is crap. Pure crap. Why? Because if it's hard to test, you aren't going to test it well enough. And if you don't test it well enough, it's not going to work when you need it to work. And if it doesn't work when you need it to work the design is crap.

The Three Go Landmines
'There are three easy to make mistakes in go. I present them here in the way they are often found in the wild, not in the way that is easiest to understand. All three of these mistakes have been made in Kubernetes code, getting past code review at least once each that I know of.'
a static type checker for Javascript, from Facebook
Cat-Herd's Crook
Nice approach from MongoDB:
we’ve recently gained momentum on standardizing our [cross-platform test] drivers. Human-readable, machine-testable specs, coded in YAML, prove which code conforms and which does not. These YAML tests are the Cat-Herd’s Crook: a tool to guide us all in the same direction.
How to do distributed locking
A critique of the "Redlock" locking algorithm from Redis by Martin Kleppman. antirez responds here: ; summary of followups:
git integrity - Google Groups
It seems git's default behavior in many situations is -- despite communicating objectID by content-addressable hashes which should be sufficient to assure some integrity -- it may not actually bother to *check* them.  Yes, even when receiving objects from other repos.  So, enabling these configuration parameters may "slow down" your git operations.  The return is actually noticing if someone ships you a bogus object.  Everyone should enable these.
The general birthday problem
Good explanation and scipy code for the birthday paradox and hash collisions
Schema evolution in Avro, Protocol Buffers and Thrift
Good description of this key feature of decent serialization formats
"So you have a mess on your hands" [png]
Excellent flowchart of how to fix common git screwups (via ITC slack)
How open-source software developers helped end the Ebola epidemic in Sierra Leone
Little known to the rest of the world, a team of open source software developers played a small but integral part in helping to stop the spread of Ebola in Sierra Leone, solving a payroll crisis that was hindering the fight against the disease.

Emerson Tan from NetHope, a consortium of NGOs working in IT and development, told the tale at the Chaos Communications Congress in Hamburg, Germany. “These guys basically saved their country from complete collapse. I can’t overestimate how many lives they saved,” he said about his co-presenters, Salton Arthur Massally, Harold Valentine Mac-Saidu and Francis Banguara, who appeared over video link.
Introducing Netty-HTTP from Cask
netty-http library solves [Netty usability issues] by using JAX-RS annotations to build a HTTP path routing layer on top of netty. In addition, the library implements a guava service to manage the HTTP service. netty-http allows users of the library to just focus on writing the business logic in HTTP handlers without having to worry about the complexities of path routing or learning netty pipeline internals to build the HTTP service.

We've written something very similar, although I didn't even bother supporting JAX-RS annotations -- just a simple code-level DSL.
The End of Dynamic Languages
This is my bet: the age of dynamic languages is over. There will be no new successful ones. Indeed we have learned a lot from them. We’ve learned that library code should be extendable by the programmer (mixins and meta-programming), that we want to control the structure (macros), that we disdain verbosity. And above all, we’ve learned that we want our languages to be enjoyable.

But it’s time to move on. We will see a flourishing of languages that feel like you’re writing in a Clojure, but typed. Included will be a suite of powerful tools that we’ve never seen before, tools so convincing that only ascetics will ignore.
CiteSeerX — The Confounding Effect of Class Size on the Validity of Object-oriented Metrics
A lovely cite from @conor. Turns out the sheer size of an OO class is itself a solid fault-proneness metric
Cache-friendly binary search
by reordering items to optimize locality. Via aphyr's dad!
PICO-8 is a fantasy console for making, sharing and playing tiny games and other computer programs. When you turn it on, the machine greets you with a shell for typing in Lua commands and provides simple built-in tools for creating your own cartridges.

So cute! See also Voxatron, something similar for voxel-oriented 3D gaming
GTA V - Graphics Study
how GTAV renders a single frame. this is amazingly detailed
It's an Emulator, Not a Petting Zoo: Emu and Lambda
a Lambda emulator in Python, suitable for unit testing lambdas
Twins denied driver’s permit because DMV can’t tell them apart
"The computer can recognize faces, a feature that comes in handy if somebody’s is trying to get an illegal ID. It apparently is not programmed to detect twins."

As Hilary Mason put it: "You do not want to be an edge case in this future we are building."
How Netty is used at Layer
pretty conventional HTTP/1.1, WebSockets and HTTP/2 front-end services with modern Netty practices
Hologram exposes an imitation of the EC2 instance metadata service on developer workstations that supports the [IAM Roles] temporary credentials workflow. It is accessible via the same HTTP endpoint to calling SDKs, so your code can use the same process in both development and production. The keys that Hologram provisions are temporary, so EC2 access can be centrally controlled without direct administrative access to developer workstations.
Defending Your Time
great post from Ross Duggan on avoiding developer burnout
_What We Know About Spreadsheet Errors_ [paper]
As we will see below, there has long been ample evidence that errors in spreadsheets are pandemic. Spreadsheets, even after careful development, contain errors in one percent or more of all formula cells. In large spreadsheets with thousands of formulas, there will be dozens of undetected errors. Even significant errors may go undetected because formal testing in spreadsheet development is rare and because even serious errors may not be apparent.
How IFTTT develop with Docker
ugh, quite a bit of complexity here
a regex-based, Turing-complete programming language. It's main feature is taking some text via standard input and repeatedly applying regex operations to it (e.g. matching, splitting, and most of all replacing). Under the hood, it uses .NET's regex engine, which means that both the .NET flavour and the ECMAScript flavour are available.

Reminscent of sed(1); see for an example Retina program
httpbin(1): HTTP Client Testing Service
Testing an HTTP Library can become difficult sometimes. RequestBin is fantastic for testing POST requests, but doesn't let you control the response. This exists to cover all kinds of HTTP scenarios. Additional endpoints are being considered.
The Pixel Factory
amazing slideshow/WebGL demo talking about graphics programming, its maths, and GPUs
You're probably wrong about caching
Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.
