jm + netflix   47

Developer Experience Lessons Operating a Serverless-like Platform at Netflix
Very interesting writeup on how Netflix are finding operating a serverless scripting system; they offer scriptability in their backend and it's used heavily by devs to provide features. Lots of having to reinvent the wheel on packaging, deployment, versioning, and test/staging infrastructure
serverless  dependencies  packaging  deployment  versioning  devex  netflix  developer-experience  dev  testing  staging  scripting 
9 weeks ago by jm
Towards true continuous integration – Netflix TechBlog – Medium
Netflix discuss how they handle the eternal dependency-management problem which arises with lots of microservices:
Using the monorepo as our requirements specification, we began exploring alternative approaches to achieving the same benefits. What are the core problems that a monorepo approach strives to solve? Can we develop a solution that works within the confines of a traditional binary integration world, where code is shared? Our approach, while still experimental, can be distilled into three key features:

Publisher feedback — provide the owner of shared code fast feedback as to which of their consumers they just broke, both direct and transitive. Also, allow teams to block releases based on downstream breakages. Currently, our engineering culture puts sole responsibility on consumers to resolve these issues. By giving library owners feedback on the impact they have to the rest of Netflix, we expect them to take on additional responsibility.

Managed source — provide consumers with a means to safely increment library versions automatically as new versions are released. Since we are already testing each new library release against all downstreams, why not bump consumer versions and accelerate version adoption, safely.

Distributed refactoring — provide owners of shared code a means to quickly find and globally refactor consumers of their API. We have started by issuing pull requests en masse to all Git repositories containing a consumer of a particular Java API. We’ve run some early experiments and expect to invest more in this area going forward.


What I find interesting is that Amazon dealt effectively with the first two many years ago, in the form of their "Brazil" build system, and Google do the latter (with Refaster?). It would be amazing to see such a system released into an open source form, but maybe it's just too heavyweight for anyone other than a giant software company on the scale of a Google, Netflix or Amazon.
brazil  amazon  build  microservices  dependencies  coding  monorepo  netflix  google  refaster 
may 2017 by jm
Introducing Winston
'Event driven Diagnostic and Remediation Platform' -- aka 'runbooks as code'
runbooks  winston  netflix  remediation  outages  mttr  ops  devops 
august 2016 by jm
Netflix Global Search
handy -- search Netflix in all regions, then show where the show/movie is available. Probably going to be less handy from now on now that Netflix is blocking region-spoofing
movies  video  netflix  films  tv  world 
january 2016 by jm
Global Continuous Delivery with Spinnaker
Netflix' CD platform, post-Atlas. looks interesting
continuous-delivery  aws  netflix  cd  devops  ops  atlas  spinnaker 
november 2015 by jm
Chaos Engineering Upgraded
some details on Netflix's Chaos Monkey, Chaos Kong and other aspects of their availability/failover testing
architecture  aws  netflix  ops  chaos-monkey  chaos-kong  testing  availability  failover  ha 
september 2015 by jm
Streaming will soon pass traditional TV - Tech Insider
the percentage of people who say they stream video from services like Netflix, YouTube, and Hulu each day has increased dramatically over the last five years, from about 30% in 2010 to more than 50% this year. During the same period, the percentage of people who say they watch traditional TV [...] has dropped by about 10%. When the beige line surpasses the purple line [looks like 2016], it will mean that more people are streaming each day than are watching traditional TV. 
streaming  hulu  netflix  tv  television  video  youtube 
september 2015 by jm
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Extremely authoritative slide deck on building a recommendation system, from Xavier Amatriain, Research/Engineering Manager at Netflix
netflix  recommendations  recommenders  ml  machine-learning  cmu  clustering  algorithms 
august 2015 by jm
The Netflix Test Video
Netflix' official test video -- contains various scenarios which exercise frequent tricky edge cases in video compression and playback; A/V sync, shades of black, running water, etc.
networking  netflix  streaming  video  compression  tests 
august 2015 by jm
Outlier Detection at Netflix | Hacker News
Excellent HN thread re automated anomaly detection in production, Q&A with the dev team
machine-learning  ml  remediation  anomaly-detection  netflix  ops  time-series  clustering 
july 2015 by jm
Introducing Vector: Netflix's On-Host Performance Monitoring Tool
It gives pinpoint real-time performance metric visibility to engineers working on specific hosts -- basically sending back system-level performance data to their browser, where a client-side renderer turns it into a usable dashboard. Essentially the idea is to replace having to ssh onto instances, run "top", systat, iostat, and so on.
vector  netflix  performance  monitoring  sysstat  top  iostat  netstat  metrics  ops  dashboards  real-time  linux 
april 2015 by jm
Can Spark Streaming survive Chaos Monkey?
good empirical results on Spark's resilience to network/host outages in EC2
ec2  aws  emr  spark  resilience  ha  fault-tolerance  chaos-monkey  netflix 
march 2015 by jm
how Curator fixed issues with the Hive ZooKeeper Lock Manager Implementation
Ugh, ZK is a bear to work with.
Apache Curator is open source software which is able to handle all of the above scenarios transparently. Curator is a Netflix ZooKeeper Library and it provides a high-level API, CuratorFramework, that simplifies using ZooKeeper. By using a singleton CuratorFramework instance in the new ZooKeeperHiveLockManager implementation, we not only fixed the ZooKeeper connection issues, but also made the code easy to understand and maintain.  
zookeeper  apis  curator  netflix  distributed-locks  coding  hive 
february 2015 by jm
Performance Co-Pilot
System performance metrics framework, plugged by Netflix, open-source for ages
open-source  pcp  performance  system  metrics  ops  red-hat  netflix 
february 2015 by jm
Mantis: Netflix's Event Stream Processing System
Rx/reactive in style, autoscaling, support for queue/broker-based strong consistency as well as TCP-based lossy delivery
netflix  rx  reactive  autoscaling  mantis  stream-processing 
january 2015 by jm
Introducing Atlas: Netflix's Primary Telemetry Platform
This sounds really excellent -- the dimensionality problem it deals with is a familiar one, particularly with red/black deployments, autoscaling, and so on creating trees of metrics when new transient servers appear and disappear. Looking forward to Netflix open sourcing enough to make it usable for outsiders
netflix  metrics  service-metrics  atlas  telemetry  ops 
december 2014 by jm
Netflix release new code to production before completing tests
Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...
If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments. 
devops  deployment  feature-flags  release  testing  integration-tests  uat  qa  production  ops  gating  netflix 
october 2014 by jm
Inviso: Visualizing Hadoop Performance
With the increasing size and complexity of Hadoop deployments, being able to locate and understand performance is key to running an efficient platform.  Inviso provides a convenient view of the inner workings of jobs and platform.  By simply overlaying a new view on existing infrastructure, Inviso can operate inside any Hadoop environment with a small footprint and provide easy access and insight.  


This sounds pretty useful.
inviso  netflix  hadoop  emr  performance  ops  tools 
september 2014 by jm
The "sidecar" pattern
Ha, great name. We use this (in the form of Smartstack).
For what it is worth, we faced a similar challenge in earlier services (mostly due to existing C/C++ applications) and we created what was called a "sidecar".  By sidecar, what I mean is a second process on each node/instance that did Cloud Service Fabric operations on behalf of the main process (the side-managed process).  Unfortunately those sidecars all went off and created one-offs for their particular service.  In this post, I'll describe a more general sidecar that doesn't force users to have these one-offs.

Sidenote:  For those not familiar with sidecars, think of the motorcycle sidecar below.  Snoopy would be the main process with Woodstock being the sidecar process.  The main work on the instance would be the motorcycle (say serving your users' REST requests).  The operational control is the sidecar (say serving health checks and management plane requests of the operational platform).
netflix  sidecars  architecture  patterns  smartstack  netflixoss  microservices  soa 
august 2014 by jm
Netflix/ribbon
a client side IPC library that is battle-tested in cloud. It provides the following features:

Load balancing;
Fault tolerance;
Multiple protocol (HTTP, TCP, UDP) support in an asynchronous and reactive model;
Caching and batching.

I like the integration of Eureka and Hystrix in particular, although I would really like to read more about Eureka's approach to availability during network partitions and CAP.

https://groups.google.com/d/msg/eureka_netflix/LXKWoD14RFY/-5nElGl1OQ0J has some interesting discussion on the topic. It actually sounds like the Eureka approach is more correct than using ZK: 'Eureka is available. ZooKeeper, while tolerant against single node failures, doesn't react well to long partitioning events. For us, it's vastly more important that we maintain an available registry than a necessary consistent registry. If us-east-1d sees 23 nodes, and us-east-1c sees 22 nodes for a little bit, that's OK with us.'

See also http://ispyker.blogspot.ie/2013/12/zookeeper-as-cloud-native-service.html which corroborates this:

I went into one of the instances and quickly did an iptables DROP on all packets coming from the other two instances. This would simulate an availability zone continuing to function, but that zone losing network connectivity to the other availability zones. What I saw was that the two other instances noticed that the first server “going away”, but they continued to function as they still saw a majority (66%). More interestingly the first instance noticed the other two servers “going away” dropping the ensemble availability to 33%. This caused the first server to stop serving requests to clients (not only writes, but also reads). [...]

To me this seems like a concern, as network partitions should be considered an event that should be survived. In this case (with this specific configuration of zookeeper) no new clients in that availability zone would be able to register themselves with consumers within the same availability zone. Adding more zookeeper instances to the ensemble wouldn’t help considering a balanced deployment as in this case the availability would always be majority (66%) and non-majority (33%).
netflix  ribbon  availability  libraries  java  hystrix  eureka  aws  ec2  load-balancing  networking  http  tcp  architecture  clients  ipc 
july 2014 by jm
Netflix comes out strongly against Comcast
In sum, Comcast is not charging Netflix for transit service. It is charging Netflix for access to its subscribers. Comcast also charges its subscribers for access to Internet content providers like Netflix. In this way, Comcast is double dipping by getting both its subscribers and Internet content providers to pay for access to each other.


FIGHT!
netflix  comcast  network-neutrality  cartels  competition  us-politics  business  isps 
april 2014 by jm
Internet Tolls And The Case For Strong Net Neutrality
Netflix CEO Reed Hastings blogs about the need for Net Neutrality:
Interestingly, there is one special case where no-fee interconnection is embraced by the big ISPs -- when they are connecting among themselves. They argue this is because roughly the same amount of data comes and goes between their networks. But when we ask them if we too would qualify for no-fee interconnect if we changed our service to upload as much data as we download** -- thus filling their upstream networks and nearly doubling our total traffic -- there is an uncomfortable silence. That's because the ISP argument isn't sensible. Big ISPs aren't paying money to services like online backup that generate more upstream than downstream traffic. Data direction, in other words, has nothing to do with costs. ISPs around the world are investing in high-speed Internet and most already practice strong net neutrality. With strong net neutrality, new services requiring high-speed Internet can emerge and become popular, spurring even more demand for the lucrative high-speed packages ISPs offer. With strong net neutrality, everyone avoids the kind of brinkmanship over blackouts that plague the cable industry and harms consumers. As the Wall Street Journal chart shows, we're already getting to the brownout stage. Consumers deserve better.
consumer  net-neutrality  comcast  netflix  protectionism  cartels  isps  us  congestion  capacity 
march 2014 by jm
The Netflix Dynamic Scripting Platform
At the core of the redesign is a Dynamic Scripting Platform which provides us the ability to inject code into a running Java application at any time. This means we can alter the behavior of the application without a full scale deployment. As you can imagine, this powerful capability is useful in many scenarios. The API Server is one use case, and in this post, we describe how we use this platform to support a distributed development model at Netflix.


Holy crap.
scripting  dynamic-languages  groovy  java  server-side  architecture  netflix 
march 2014 by jm
Netflix packets being dropped every day because Verizon wants more money | Ars Technica
With Cogent and Verizon fighting, [peering capacity] upgrades are happening at a glacial pace, according to Schaeffer.

"Once a port hits about 85 percent throughput, you're going to begin to start to drop packets," he said. "Clearly when a port is at 120 or 130 percent [as the Cogent/Verizon ones are] the packet loss is material."

The congestion isn't only happening at peak times, he said. "These ports are so over-congested that they're running in this packet dropping state 22, 24 hours a day. Maybe at four in the morning on Tuesday or something there might be a little bit of headroom," he said.
packet-loss  networking  internet  cogent  netflix  verizon  peering 
february 2014 by jm
Comcast’s deal with Netflix makes network neutrality obsolete
in a world where Netflix and Yahoo connect directly to residential ISPs, every Internet company will have its own separate pipe. And policing whether different pipes are equally good is a much harder problem than requiring that all of the traffic in a single pipe be treated the same. If it wanted to ensure a level playing field, the FCC would be forced to become intimately involved in interconnection disputes, overseeing who Verizon interconnects with, how fast the connections are and how much they can charge to do it.
verizon  comcast  internet  peering  networking  netflix  network-neutrality 
february 2014 by jm
Apache Curator
Netflix open-source library to make using ZooKeeper from Java less of a PITA. I really wish I'd used this now, having reimplemented some key parts of it after failures in prod ;)
zookeeper  netflix  apache  curator  java  libraries  open-source 
january 2014 by jm
Yammer Engineering - Resiliency at Yammer
Not content with adding Hystrix (circuit breakers, threadpooling, request time limiting, metrics, etc.) to their entire SOA stack, they've made it incredibly configurable by hooking in a web-based configuration UI, allowing dynamic on-the-fly reconfiguration by their ops guys of the circuit breakers and threadpools in production. Mad stuff
hystrix  circuit-breakers  resiliency  yammer  ops  threadpools  soa  dynamic-configuration  archaius  netflix 
january 2014 by jm
Netflix: Your Linux AMI: optimization and performance [slides]
a fantastic bunch of low-level kernel tweaks and tunables which Netflix have found useful in production to maximise productivity of their fleet. Interesting use of SCHED_BATCH process scheduler class for batch processes, in particular. Also, great docs on their experience with perf and SystemTap. Perf really looks like a tool I need to get to grips with...
netflix  aws  tuning  ami  perf  systemtap  tunables  sched_batch  batch  hadoop  optimization  performance 
december 2013 by jm
Scryer: Netflix’s Predictive Auto Scaling Engine
Scryer is a new system that allows us to provision the right number of AWS instances needed to handle the traffic of our customers. But Scryer is different from Amazon Auto Scaling (AAS), which reacts to real-time metrics and adjusts instance counts accordingly. Rather, Scryer predicts what the needs will be prior to the time of need and provisions the instances based on those predictions.
scaling  infrastructure  aws  ec2  netflix  scryer  auto-scaling  aas  metrics  prediction  spikes 
november 2013 by jm
Introducing Chaos to C*
Autoremediation, ie. auto-replacement, of Cassandra nodes in production at Netflix
ops  autoremediation  outages  remediation  cassandra  storage  netflix  chaos-monkey 
october 2013 by jm
Why YouTube buffers: The secret deals that make -- and break -- online video
Should ISPs be required to ensure they have sufficient upstream bandwidth to video sites like YouTube and Netflix?
"Verizon has chosen to sell its customers a product [Netflix] that they hope those customers don't actually use," Schaeffer said. "And when customers use it and request movies, they have not ensured there is adequate connectivity to get that video content back to their customers."
netflix  youtube  streaming  video  isps  net-neutrality  peering  comcast  bandwidth  upstream 
july 2013 by jm
Announcing Zuul: Edge Service in the Cloud
Netflix' library to implement "edge services" -- ie. a front end to their API, web servers, and streaming servers. Some interesting features: dynamic filtering using Groovy scripts; Hystrix for software load balancing, fault tolerance, and error handling for originated HTTP requests; fine-grained service metrics; Archaius for configuration; and canary requests to detect overload risks. Pretty complex though
edge-services  api  netflix  zuul  archaius  canary-requests  http  groovy  hystrix  load-balancing  fault-tolerance  error-handling  configuration 
june 2013 by jm
Netflix ISP Speed Index for Ireland
Via Mulley. Magnet doing well, with UPC coming second; UPC have dropped a fair bit in the past month. Would love to see it broken down by region...
upc  ireland  isps  speed  bandwidth  netflix  broadband  magnet  eircom 
april 2013 by jm
Netflix Curator
a high-level API that greatly simplifies using ZooKeeper. It adds many features that build on ZooKeeper and handles the complexity of managing connections to the ZooKeeper cluster and retrying operations. Some of the features are:

Automatic connection management: There are potential error cases that require ZooKeeper clients to recreate a connection and/or retry operations. Curator automatically and transparently (mostly) handles these cases.

Cleaner API: simplifies the raw ZooKeeper methods, events, etc.; provides a modern, fluent interface

Recipe implementations (see Recipes): Leader election, Shared lock, Path cache and watcher, Distributed Queue, Distributed Priority Queue
zookeeper  java  netflix  distcomp  libraries  oss  open-source  distributed 
march 2013 by jm
Netflix Queue: Data migration for a high volume web application
There will come a time in the life of most systems serving data, when there is a need to migrate data to [another] data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems [SimpleDB to Cassandra].
cassandra  netflix  migrations  data  schema  simpledb  storage 
march 2013 by jm
Denominator: A Multi-Vendor Interface for DNS
the latest good stuff from Netflix.

Denominator is a portable Java library for manipulating DNS clouds. Denominator has pluggable back-ends, initially including AWS Route53, Neustar Ultra, DynECT, and a mock for testing. We also ship a command line version so it's easy for anyone to try it out.
The reason we built Denominator is that we are working on multi-region failover and traffic sharing patterns to provide higher availability for the streaming service during regional outages caused by our own bugs and AWS issues. To do this we need to directly control the DNS configuration that routes users to each region and each zone. When we looked at the features and vendors in this space we found that we were already using AWS Route53, which has a nice API but is missing some advanced features; Neustar UltraDNS, which has a SOAP based API; and DynECT, which has a REST API that uses a quite different pseudo-transactional model. We couldn’t find a Java based API that grouped together common set of capabilities that we are interested in, so we created one. The idea is that any feature that is supported by more than one vendor API is the highest common denominator, and that functionality can be switched between vendors as needed, or in the event of a DNS vendor outage.
dns  netflix  java  tools  ops  route53  aws  ultradns  dynect 
march 2013 by jm
Big Data Analytics at Netflix. Interview with Christos Kalantzis and Jason Brown.
Good interview with the Cassandra guys at Netflix, and some top Mongo-bashing in the comments
cassandra  netflix  user-stories  testimonials  nosql  storage  ec2  mongodb 
february 2013 by jm
UnoDNS
'Watch Netflix USA, Hulu, Pandora, BBC iPlayer, and more in [sic] anywhere you live!' -- seems to use similar techniques to tunlr.net, looks like it works for my Netflix
netflix  dns  tv  tunnelling  drm  networking  spotify  hulu 
february 2013 by jm
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Amazing stuff from Adrian Cockroft at last week's QCon. Faceted object model, lots of Cassandra automation
cassandra  api  design  oo  object-model  java  adrian-cockroft  slides  qcon  scaling  aws  netflix 
march 2012 by jm
Fault Tolerance in a High Volume, Distributed System
Netflix's "DependencyCommand", a resiliency system for SOA inter-service network calls, offering builtin support for threadpools, timeouts, retries and graceful failover. Very nice
netflix  architecture  concurrency  distributed  failover  ha  resiliency  fail-fast  failsafe  soa  fault-tolerance 
march 2012 by jm
Benchmarking Cassandra Scalability on AWS - Over a million writes per second
NetFlix' benchmarks -- impressively detailed. '48, 96, 144 and 288 instances', across 3 EC2 AZs in us-east, successfully scaling linearly
ec2  aws  cassandra  scaling  benchmarks  netflix  performance 
november 2011 by jm
Building with Legos
Netflix tech blog on how they deploy their services. Notably, they avoid the Puppet/Chef approach, citing these reasons: 'One is that it eliminates a number of dependencies in the production environment: a master control server, package repository and client scripts on the servers, network permissions to talk to all of these. Another is that it guarantees that what we test in the test environment is the EXACT same thing that is deployed in production; there is very little chance of configuration or other creep/bit rot. Finally, it means that there is no way for people to change or install things in the production environment (this may seem like a really harsh restriction, but if you can build a new AMI fast enough it doesn't really make a difference).'
devops  cloud  aws  netflix  puppet  chef  deployment 
august 2011 by jm
Netflix Beats BitTorrent’s Bandwidth
'For perhaps the first time in the internet’s history, the largest percentage of the net’s traffic is content that is paid for.' A great demo of how *good*, legit, for-pay services, can beat out less usable, dodgy, but free ones (via Waxy)
via:waxy  piracy  bandwidth  bittorrent  internet  netflix  filesharing 
may 2011 by jm
Netflix: Dev and Ops internals
extensive details on the innards of Netflix' move to AWS, from the legendary Adrian Cockcroft
adrian-cockcroft  aws  netflix  ops  cloud  from delicious
november 2010 by jm

related tags

aas  adrian-cockcroft  adrian-cockroft  algorithms  amazon  ami  anomaly-detection  apache  api  apis  archaius  architecture  atlas  auto-scaling  autoremediation  autoscaling  availability  aws  bandwidth  batch  benchmarks  bittorrent  brazil  broadband  bugs  build  business  canary-requests  capacity  cartels  cassandra  cd  chaos-kong  chaos-monkey  chef  circuit-breakers  client-side  clients  cloud  clustering  cmu  coding  cogent  comcast  competition  compression  concurrency  configuration  congestion  consistency  consumer  continuous-delivery  curator  dashboards  data  dependencies  deployment  design  dev  developer-experience  devex  devops  distcomp  distributed  distributed-locks  dns  drm  dynamic-configuration  dynamic-languages  dynect  ec2  edge-services  eircom  emr  error-handling  eureka  eventual-consistency  fail-fast  failover  failsafe  fault-tolerance  feature-flags  filesharing  films  gating  google  groovy  ha  hadoop  hive  http  hulu  hystrix  infrastructure  integration-tests  internet  inviso  iostat  ipc  ireland  isps  java  libraries  linux  load-balancing  machine-learning  magnet  mantis  metrics  microservices  migrations  ml  mongodb  monitoring  monorepo  movies  mttr  net-neutrality  netflix  netflixoss  netstat  network-neutrality  networking  nosql  object-model  oo  open-source  ops  optimization  oss  outages  packaging  packet-loss  patterns  pcp  peering  perf  performance  piracy  prediction  production  protectionism  puppet  qa  qcon  reactive  real-time  recommendations  recommenders  red-hat  refaster  release  remediation  resilience  resiliency  ribbon  route53  runbooks  rx  s3  scaling  sched_batch  schema  scripting  scryer  server-side  serverless  service-metrics  sidecars  simpledb  slides  smartstack  soa  spark  speed  spikes  spinnaker  spotify  staging  storage  stream-processing  streaming  sysstat  system  systemtap  tcp  telemetry  television  testimonials  testing  tests  threadpools  time-series  tools  top  tunables  tuning  tunnelling  tv  uat  ultradns  upc  upstream  us  us-politics  user-stories  vector  verizon  versioning  via:waxy  video  winston  world  yammer  youtube  zookeeper  zuul 

Copy this bookmark:



description:


tags: