LinkedIn called me a white supremacist
may 2016 by jm
Wow. Massive, massive algorithm fail.
ethics
fail
algorithm
linkedin
big-data
racism
libel
n the morning of May 12, LinkedIn, the networking site devoted to making professionals “more productive and successful,” emailed scores of my contacts and told them I’m a professional racist. It was one of those updates that LinkedIn regularly sends its users, algorithmically assembled missives about their connections’ appearances in the media. This one had the innocent-sounding subject, “News About William Johnson,” but once my connections clicked in, they saw a small photo of my grinning face, right above the headline “Trump put white nationalist on list of delegates.” [.....] It turns out that when LinkedIn sends these update emails, people actually read them. So I was getting upset. Not only am I not a Nazi, I’m a Jewish socialist with family members who were imprisoned in concentration camps during World War II. Why was LinkedIn trolling me?
may 2016 by jm
Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark
neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
linkedin
oss
hadoop
spark
performance
tuning
ops
april 2016 by jm
[LinkedIn] are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows.
neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
april 2016 by jm
Open-sourcing PalDB, a lightweight companion for storing side data
october 2015 by jm
a new LinkedIn open source data store, for write-once/read-mainly side data, java, Apache licensed.
RocksDB discussion: https://www.facebook.com/groups/rocksdb.dev/permalink/834956096602906/
linkedin
open-source
storage
side-data
data
config
paldb
java
apache
databases
RocksDB discussion: https://www.facebook.com/groups/rocksdb.dev/permalink/834956096602906/
october 2015 by jm
Introducing Nurse: Auto-Remediation at LinkedIn
august 2015 by jm
Interesting to hear about auto-remediation in prod -- we built a (very targeted) auto-remediation system in Amazon on the Network Monitoring team, but this is much bigger in focus
nurse
auto-remediation
outages
linkedin
ops
monitoring
august 2015 by jm
Optimizing Java CMS garbage collections, its difficulties, and using JTune as a solution | LinkedIn Engineering
april 2015 by jm
I like the sound of this -- automated Java CMS GC tuning, kind of like a free version of JClarity's Censum (via Miguel Ángel Pastor)
java
jvm
tuning
gc
cms
linkedin
performance
ops
april 2015 by jm
Amazing comment from a random sysadmin who's been targeted by the NSA
nsa
via:ioerror
privacy
spying
surveillance
linkedin
sysadmins
gchq
security
january 2015 by jm
'Here's a story for you.
I'm not a party to any of this. I've done nothing wrong, I've never been suspected of doing anything wrong, and I don't know anyone who has done anything wrong. I don't even mean that in the sense of "I pissed off the wrong people but technically haven't been charged." I mean that I am a vanilla, average, 9-5 working man of no interest to anybody. My geographical location is an accident of my birth. Even still, I wasn't accidentally born in a high-conflict area, and my government is not at war. I'm a sysadmin at a legitimate ISP and my job is to keep the internet up and running smoothly.
This agency has stalked me in my personal life, undermined my ability to trust my friends attempting to connect with me on LinkedIn, and infected my family's computer. They did this because they wanted to bypass legal channels and spy on a customer who pays for services from my employer. Wait, no, they wanted the ability to potentially spy on future customers. Actually, that is still not accurate - they wanted to spy on everybody in case there was a potentially bad person interacting with a customer.
After seeing their complete disregard for anybody else, their immense resources, and their extremely sophisticated exploits and backdoors - knowing they will stop at nothing, and knowing that I was personally targeted - I'll be damned if I can ever trust any electronic device I own ever again.
You all rationalize this by telling me that it "isn't surprising", and that I don't live in the [USA,UK] and therefore I have no rights.
I just have one question.
Are you people even human?'
january 2015 by jm
FelixGV/tehuti
october 2014 by jm
Felix says:
'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.
It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).
Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'
asl2
apache
open-source
tehuti
metrics
percentiles
quantiles
statistics
measurement
latency
kafka
voldemort
linkedin
'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.
It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).
Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'
october 2014 by jm
Tehuti
october 2014 by jm
An embryonic metrics library for Java/Scala from Felix GV at LinkedIn, extracted from Kafka's metric implementation and in the new Voldemort release. It fixes the major known problems with the Meter/Timer implementations in Coda-Hale/Dropwizard/Yammer Metrics.
'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...
As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'
More background at the kafka-dev thread: http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E
kafka
metrics
dropwizard
java
scala
jvm
timers
ewma
statistics
measurement
latency
sampling
tehuti
voldemort
linkedin
jay-kreps
'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...
As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'
More background at the kafka-dev thread: http://mail-archives.apache.org/mod_mbox/kafka-dev/201402.mbox/%3C131A7649-ED57-45CB-B4D6-F34063267664@linkedin.com%3E
october 2014 by jm
Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications
april 2014 by jm
LinkedIn talk about the GC opts they used to optimize the Feed. good detail
performance
optimization
linkedin
java
jvm
gc
tuning
april 2014 by jm
Home · linkedin/rest.li Wiki
The new underlying comms layer for Voldemort, it seems.
voldemort
d2
rest.li
linkedin
json
rest
http
api
frameworks
java
february 2014 by jm
Rest.li is a REST+JSON framework for building robust, scalable service architectures using dynamic discovery and simple asynchronous APIs. Rest.li fills a niche for building RESTful service architectures at scale, offering a developer workflow for defining data and REST APIs that promotes uniform interfaces, consistent data modeling, type-safety, and compatibility checked API evolution.
The new underlying comms layer for Voldemort, it seems.
february 2014 by jm
The Log: What every software engineer should know about real-time data's unifying abstraction | LinkedIn Engineering
december 2013 by jm
Fantastic long-form blog post by Jay Kreps on this key concept. great stuff
coding
databases
log
network
kafka
jay-kreps
linkedin
architecture
storage
december 2013 by jm
Response to "Optimizing Linux Memory Management..."
october 2013 by jm
A follow up to the LinkedIn VM-tuning blog post at http://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases --
linux
tuning
vm
kernel
linkedin
memory
numa
Do not read in to this article too much, especially for trying to understand how the Linux VM or the kernel works. The authors misread the "global spinlock on the zone" source code and the interpretation in the article is dead wrong.
october 2013 by jm
Voldemort on Solid State Drives [paper]
september 2013 by jm
'This paper and talk was given by the LinkedIn Voldemort Team at the Workshop on Big Data Benchmarking (WBDB May 2012).'
voldemort
storage
ssd
disk
linkedin
big-data
jvm
tuning
ops
gc
With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems which have little control over the storage layer and rely on Java memory management. Big heapsizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.
september 2013 by jm
Using set cover algorithm to optimize query latency for a large scale distributed graph | LinkedIn Engineering
august 2013 by jm
how LI solved a tricky graph-database-query latency problem with a set-cover algorithm
linkedin
algorithms
coding
distributed-systems
graph
databases
querying
set-cover
set
replication
august 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
june 2013 by jm
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:
1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.
#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.
There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.
It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.
The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.
For service APIs with real low-latency requirements, this is a great idea.
gc-scout
gc
java
scaling
scalability
linkedin
qcon
async
threadpools
rest
slas
timeouts
networking
distcomp
netty
tcp
udp
failover
fault-tolerance
packet-loss
1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.
#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.
There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.
It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.
The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.
For service APIs with real low-latency requirements, this is a great idea.
june 2013 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
june 2013 by jm
LinkedIn have implemented an automated root-cause detection system:
This is a topic close to my heart after working on something similar for 3 years in Amazon!
Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)
Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin
soa
root-cause
alarming
correlation
service-metrics
machine-learning
graphs
monitoring
This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.
This is a topic close to my heart after working on something similar for 3 years in Amazon!
Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)
Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
june 2013 by jm
Hadoop Operations at LinkedIn [slides]
march 2013 by jm
another good Hadoop-at-scale presentation, from LI this time
hadoop
scaling
linkedin
ops
march 2013 by jm
Announcing the Voldemort 1.3 Open Source Release
march 2013 by jm
new release from LinkedIn -- better p90/p99 PUT performance, improvements to the BDB-JE storage layer, massively-improved rebalance performance
voldemort
linkedin
open-source
bdb
nosql
march 2013 by jm
Autometrics: Self-service metrics collection
february 2012 by jm
how LinkedIn built a service-metrics collection and graphing infrastructure using Kafka and Zookeeper, writing to RRD files, handling 8.8k metrics per datacenter per second
kafka
zookeeper
linkedin
sysadmin
service-metrics
february 2012 by jm
Apache Kafka
february 2012 by jm
'Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.' neat
kafka
linkedin
apache
distributed
messaging
pubsub
queue
incubator
scaling
february 2012 by jm
Dutch grepping Facebook for welfare fraud
september 2011 by jm
'The [Dutch] councils are working with a specialist Amsterdam research firm, using the type of computer software previously deployed only in counterterrorism, monitoring [LinkedIn, Facebook and Twitter] traffic for keywords and cross-referencing any suspicious information with digital lists of social welfare recipients.
Among the giveaway terms, apparently, are “holiday” and “new car”. If the automated software finds a match between one of these terms and a person claiming social welfare payments, the information is passed on to investigators to gather real-life evidence.' With a 30% false positive rate, apparently -- let's hope those investigations aren't too intrusive!
grep
dutch
holland
via:tjmcintyre
privacy
facebook
twitter
linkedin
welfare
dole
fraud
false-positives
searching
Among the giveaway terms, apparently, are “holiday” and “new car”. If the automated software finds a match between one of these terms and a person claiming social welfare payments, the information is passed on to investigators to gather real-life evidence.' With a 30% false positive rate, apparently -- let's hope those investigations aren't too intrusive!
september 2011 by jm
related tags
alarming ⊕ algorithm ⊕ algorithms ⊕ apache ⊕ api ⊕ architecture ⊕ asl2 ⊕ async ⊕ auto-remediation ⊕ bdb ⊕ big-data ⊕ cms ⊕ coding ⊕ config ⊕ correlation ⊕ d2 ⊕ data ⊕ databases ⊕ disk ⊕ distcomp ⊕ distributed ⊕ distributed-systems ⊕ dole ⊕ dropwizard ⊕ dutch ⊕ ethics ⊕ ewma ⊕ facebook ⊕ fail ⊕ failover ⊕ false-positives ⊕ fault-tolerance ⊕ frameworks ⊕ fraud ⊕ gc ⊕ gc-scout ⊕ gchq ⊕ graph ⊕ graphs ⊕ grep ⊕ hadoop ⊕ holland ⊕ http ⊕ incubator ⊕ java ⊕ jay-kreps ⊕ json ⊕ jvm ⊕ kafka ⊕ kernel ⊕ latency ⊕ libel ⊕ linkedin ⊖ linux ⊕ log ⊕ machine-learning ⊕ measurement ⊕ memory ⊕ messaging ⊕ metrics ⊕ monitoring ⊕ netty ⊕ network ⊕ networking ⊕ nosql ⊕ nsa ⊕ numa ⊕ nurse ⊕ open-source ⊕ ops ⊕ optimization ⊕ oss ⊕ outages ⊕ packet-loss ⊕ paldb ⊕ percentiles ⊕ performance ⊕ privacy ⊕ pubsub ⊕ qcon ⊕ quantiles ⊕ querying ⊕ queue ⊕ racism ⊕ replication ⊕ rest ⊕ rest.li ⊕ root-cause ⊕ sampling ⊕ scala ⊕ scalability ⊕ scaling ⊕ searching ⊕ security ⊕ service-metrics ⊕ set ⊕ set-cover ⊕ side-data ⊕ slas ⊕ soa ⊕ spark ⊕ spying ⊕ ssd ⊕ statistics ⊕ storage ⊕ surveillance ⊕ sysadmin ⊕ sysadmins ⊕ tcp ⊕ tehuti ⊕ threadpools ⊕ timeouts ⊕ timers ⊕ tuning ⊕ twitter ⊕ udp ⊕ via:ioerror ⊕ via:tjmcintyre ⊕ vm ⊕ voldemort ⊕ welfare ⊕ zookeeper ⊕Copy this bookmark: