jm + linkedin   21

LinkedIn called me a white supremacist
Wow. Massive, massive algorithm fail.
n the morning of May 12, LinkedIn, the networking site devoted to making professionals “more productive and successful,” emailed scores of my contacts and told them I’m a professional racist. It was one of those updates that LinkedIn regularly sends its users, algorithmically assembled missives about their connections’ appearances in the media. This one had the innocent-sounding subject, “News About William Johnson,” but once my connections clicked in, they saw a small photo of my grinning face, right above the headline “Trump put white nationalist on list of delegates.” [.....] It turns out that when LinkedIn sends these update emails, people actually read them. So I was getting upset. Not only am I not a Nazi, I’m a Jewish socialist with family members who were imprisoned in concentration camps during World War II. Why was LinkedIn trolling me?
ethics  fail  algorithm  linkedin  big-data  racism  libel 
may 2016 by jm
Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark
[LinkedIn] are proud to announce today that we are open sourcing Dr. Elephant, a powerful tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows.

neat, although I've been bitten too many times by LinkedIn OSS release quality at this point to jump in....
linkedin  oss  hadoop  spark  performance  tuning  ops 
april 2016 by jm
Open-sourcing PalDB, a lightweight companion for storing side data
a new LinkedIn open source data store, for write-once/read-mainly side data, java, Apache licensed.

RocksDB discussion:
linkedin  open-source  storage  side-data  data  config  paldb  java  apache  databases 
october 2015 by jm
Introducing Nurse: Auto-Remediation at LinkedIn
Interesting to hear about auto-remediation in prod -- we built a (very targeted) auto-remediation system in Amazon on the Network Monitoring team, but this is much bigger in focus
nurse  auto-remediation  outages  linkedin  ops  monitoring 
august 2015 by jm
Optimizing Java CMS garbage collections, its difficulties, and using JTune as a solution | LinkedIn Engineering
I like the sound of this -- automated Java CMS GC tuning, kind of like a free version of JClarity's Censum (via Miguel Ángel Pastor)
java  jvm  tuning  gc  cms  linkedin  performance  ops 
april 2015 by jm
Amazing comment from a random sysadmin who's been targeted by the NSA
'Here's a story for you.
I'm not a party to any of this. I've done nothing wrong, I've never been suspected of doing anything wrong, and I don't know anyone who has done anything wrong. I don't even mean that in the sense of "I pissed off the wrong people but technically haven't been charged." I mean that I am a vanilla, average, 9-5 working man of no interest to anybody. My geographical location is an accident of my birth. Even still, I wasn't accidentally born in a high-conflict area, and my government is not at war. I'm a sysadmin at a legitimate ISP and my job is to keep the internet up and running smoothly.
This agency has stalked me in my personal life, undermined my ability to trust my friends attempting to connect with me on LinkedIn, and infected my family's computer. They did this because they wanted to bypass legal channels and spy on a customer who pays for services from my employer. Wait, no, they wanted the ability to potentially spy on future customers. Actually, that is still not accurate - they wanted to spy on everybody in case there was a potentially bad person interacting with a customer.
After seeing their complete disregard for anybody else, their immense resources, and their extremely sophisticated exploits and backdoors - knowing they will stop at nothing, and knowing that I was personally targeted - I'll be damned if I can ever trust any electronic device I own ever again.
You all rationalize this by telling me that it "isn't surprising", and that I don't live in the [USA,UK] and therefore I have no rights.
I just have one question.
Are you people even human?'
nsa  via:ioerror  privacy  spying  surveillance  linkedin  sysadmins  gchq  security 
january 2015 by jm
Felix says:

'Like I said, I'd like to move it to a more general / non-personal repo in the future, but haven't had the time yet. Anyway, you can still browse the code there for now. It is not a big code base so not that hard to wrap one's mind around it.

It is Apache licensed and both Kafka and Voldemort are using it so I would say it is pretty self-contained (although Kafka has not moved to Tehuti proper, it is essentially the same code they're using, minus a few small fixes missing that we added).

Tehuti is a bit lower level than CodaHale (i.e.: you need to choose exactly which stats you want to measure and the boundaries of your histograms), but this is the type of stuff you would build a wrapper for and then re-use within your code base. For example: the Voldemort RequestCounter class.'
asl2  apache  open-source  tehuti  metrics  percentiles  quantiles  statistics  measurement  latency  kafka  voldemort  linkedin 
october 2014 by jm
An embryonic metrics library for Java/Scala from Felix GV at LinkedIn, extracted from Kafka's metric implementation and in the new Voldemort release. It fixes the major known problems with the Meter/Timer implementations in Coda-Hale/Dropwizard/Yammer Metrics.

'Regarding Tehuti: it has been extracted from Kafka's metric implementation. The code was originally written by Jay Kreps, and then maintained improved by some Kafka and Voldemort devs, so it definitely is not the work of just one person. It is in my repo at the moment but I'd like to put it in a more generally available (git and maven) repo in the future. I just haven't had the time yet...

As for comparing with CodaHale/Yammer, there were a few concerns with it, but the main one was that we didn't like the exponentially decaying histogram implementation. While that implementation is very appealing in terms of (low) memory usage, it has several misleading characteristics (a lack of incoming data points makes old measurements linger longer than they should, and there's also a fairly high possiblity of losing interesting outlier data points). This makes the exp decaying implementation robust in high throughput fairly constant workloads, but unreliable in sparse or spiky workloads. The Tehuti implementation provides semantics that we find easier to reason with and with a small code footprint (which we consider a plus in terms of maintainability). Of course, it is still a fairly young project, so it could be improved further.'

More background at the kafka-dev thread:
kafka  metrics  dropwizard  java  scala  jvm  timers  ewma  statistics  measurement  latency  sampling  tehuti  voldemort  linkedin  jay-kreps 
october 2014 by jm
Garbage Collection Optimization for High-Throughput and Low-Latency Java Applications
LinkedIn talk about the GC opts they used to optimize the Feed. good detail
performance  optimization  linkedin  java  jvm  gc  tuning 
april 2014 by jm
Home · linkedin/ Wiki is a REST+JSON framework for building robust, scalable service architectures using dynamic discovery and simple asynchronous APIs. fills a niche for building RESTful service architectures at scale, offering a developer workflow for defining data and REST APIs that promotes uniform interfaces, consistent data modeling, type-safety, and compatibility checked API evolution.

The new underlying comms layer for Voldemort, it seems.
voldemort  d2  linkedin  json  rest  http  api  frameworks  java 
february 2014 by jm
Response to "Optimizing Linux Memory Management..."
A follow up to the LinkedIn VM-tuning blog post at --
Do not read in to this article too much, especially for trying to understand how the Linux VM or the kernel works.  The authors misread the "global spinlock on the zone" source code and the interpretation in the article is dead wrong.
linux  tuning  vm  kernel  linkedin  memory  numa 
october 2013 by jm
Voldemort on Solid State Drives [paper]
'This paper and talk was given by the LinkedIn Voldemort Team at the Workshop on Big Data Benchmarking (WBDB May 2012).'

With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems which have little control over the storage layer and rely on Java memory management. Big heapsizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.
voldemort  storage  ssd  disk  linkedin  big-data  jvm  tuning  ops  gc 
september 2013 by jm
Building a Modern Website for Scale (QCon NY 2013) [slides]
some great scalability ideas from LinkedIn. Particularly interesting are the best practices suggested for scaling web services:

1. store client-call timeouts and SLAs in Zookeeper for each REST endpoint;
2. isolate backend calls using async/threadpools;
3. cancel work on failures;
4. avoid sending requests to GC'ing hosts;
5. rate limits on the server.

#4 is particularly cool. They do this using a "GC scout" request before every "real" request; a cheap TCP request to a dedicated "scout" Netty port, which replies near-instantly. If it comes back with a 1-packet response within 1 millisecond, send the real request, else fail over immediately to the next host in the failover set.

There's still a potential race condition where the "GC scout" can be achieved quickly, then a GC starts just before the "real" request is issued. But the incidence of GC-blocking-request is probably massively reduced.

It also helps against packet loss on the rack or server host, since packet loss will cause the drop of one of the TCP packets, and the TCP retransmit timeout will certainly be higher than 1ms, causing the deadline to be missed. (UDP would probably work just as well, for this reason.) However, in the case of packet loss in the client's network vicinity, it will be vital to still attempt to send the request to the final host in the failover set regardless of a GC-scout failure, otherwise all requests may be skipped.

The GC-scout system also helps balance request load off heavily-loaded hosts, or hosts with poor performance for other reasons; they'll fail to achieve their 1 msec deadline and the request will be shunted off elsewhere.

For service APIs with real low-latency requirements, this is a great idea.
gc-scout  gc  java  scaling  scalability  linkedin  qcon  async  threadpools  rest  slas  timeouts  networking  distcomp  netty  tcp  udp  failover  fault-tolerance  packet-loss 
june 2013 by jm
Paper: "Root Cause Detection in a Service-Oriented Architecture" [pdf]
LinkedIn have implemented an automated root-cause detection system:

This paper introduces MonitorRank, an algorithm that can reduce the time, domain knowledge, and human effort required to find the root causes of anomalies in such service-oriented architectures. In the event of an anomaly, MonitorRank provides a ranked order list of possible root causes for monitoring teams to investigate. MonitorRank uses the historical and current time-series metrics of each sensor as its input, along with the call graph generated between sensors to build an unsupervised model for ranking. Experiments on real production outage data from LinkedIn, one of the largest online social networks, shows a 26% to 51% improvement in mean
average precision in finding root causes compared to baseline and current state-of-the-art methods.

This is a topic close to my heart after working on something similar for 3 years in Amazon!

Looks interesting, although (a) I would have liked to see more case studies and examples of "real world" outages it helped with; and (b) it's very much a machine-learning paper rather than a systems one, and there is no discussion of fault tolerance in the design of the detection system, which would leave me worried that in the case of a large-scale outage event, the system itself will disappear when its help is most vital. (This was a major design influence on our team's work.)

Overall, particularly given those 2 issues, I suspect it's not in production yet. Ours certainly was ;)
linkedin  soa  root-cause  alarming  correlation  service-metrics  machine-learning  graphs  monitoring 
june 2013 by jm
Hadoop Operations at LinkedIn [slides]
another good Hadoop-at-scale presentation, from LI this time
hadoop  scaling  linkedin  ops 
march 2013 by jm
Announcing the Voldemort 1.3 Open Source Release
new release from LinkedIn -- better p90/p99 PUT performance, improvements to the BDB-JE storage layer, massively-improved rebalance performance
voldemort  linkedin  open-source  bdb  nosql 
march 2013 by jm
Autometrics: Self-service metrics collection
how LinkedIn built a service-metrics collection and graphing infrastructure using Kafka and Zookeeper, writing to RRD files, handling 8.8k metrics per datacenter per second
kafka  zookeeper  linkedin  sysadmin  service-metrics 
february 2012 by jm
Apache Kafka
'Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.' neat
kafka  linkedin  apache  distributed  messaging  pubsub  queue  incubator  scaling 
february 2012 by jm
Dutch grepping Facebook for welfare fraud
'The [Dutch] councils are working with a specialist Amsterdam research firm, using the type of computer software previously deployed only in counterterrorism, monitoring [LinkedIn, Facebook and Twitter] traffic for keywords and cross-referencing any suspicious information with digital lists of social welfare recipients.

Among the giveaway terms, apparently, are “holiday” and “new car”. If the automated software finds a match between one of these terms and a person claiming social welfare payments, the information is passed on to investigators to gather real-life evidence.' With a 30% false positive rate, apparently -- let's hope those investigations aren't too intrusive!
grep  dutch  holland  via:tjmcintyre  privacy  facebook  twitter  linkedin  welfare  dole  fraud  false-positives  searching 
september 2011 by jm

related tags

alarming  algorithm  algorithms  apache  api  architecture  asl2  async  auto-remediation  bdb  big-data  cms  coding  config  correlation  d2  data  databases  disk  distcomp  distributed  distributed-systems  dole  dropwizard  dutch  ethics  ewma  facebook  fail  failover  false-positives  fault-tolerance  frameworks  fraud  gc  gc-scout  gchq  graph  graphs  grep  hadoop  holland  http  incubator  java  jay-kreps  json  jvm  kafka  kernel  latency  libel  linkedin  linux  log  machine-learning  measurement  memory  messaging  metrics  monitoring  netty  network  networking  nosql  nsa  numa  nurse  open-source  ops  optimization  oss  outages  packet-loss  paldb  percentiles  performance  privacy  pubsub  qcon  quantiles  querying  queue  racism  replication  rest  root-cause  sampling  scala  scalability  scaling  searching  security  service-metrics  set  set-cover  side-data  slas  soa  spark  spying  ssd  statistics  storage  surveillance  sysadmin  sysadmins  tcp  tehuti  threadpools  timeouts  timers  tuning  twitter  udp  via:ioerror  via:tjmcintyre  vm  voldemort  welfare  zookeeper 

Copy this bookmark: