jm + ops   273

sjk
a command line tool for JVM diagnostic troubleshooting and profiling.
java  jvm  monitoring  commandline  jmx  sjk  tools  ops 
12 days ago by jm
CFSSL
Cloudflare's open source CA/PKI infrastructure app
cloudflare  pki  ca  ssl  tls  ops 
13 days ago by jm
Docker at Shopify: From This-Looks-Fun to Production
Pragmatic evolution story, adding Docker as a packaging/deploy format for an existing production Capistrano/Rails fleet
docker  ops  deployment  packaging  shopify  slides 
14 days ago by jm
Google Cloud Platform announces new Container Registry
Yay. Sensible Docker registry pricing at last. Given the high prices, rough edges and slow performance of the other registry offerings, I'm quite happy to see this.
Google Container Registry helps make it easy for you to store your container images in a private and encrypted registry, built on Cloud Platform. Pricing for storing images in Container Registry is simple: you only pay Google Cloud Storage costs. Pushing images is free, and pulling Docker images within a Google Cloud Platform region is free (Cloud Storage egress cost when outside of a region).

Container Registry is now ready for production use:

* Encrypted and Authenticated - Your container images are encrypted at rest, and access is authenticated using Cloud Platform OAuth and transmitted over SSL
* Fast - Container Registry is fast and can handle the demands of your application, because it is built on Cloud Storage and Cloud Networking.
* Simple - If you’re using Docker, just tag your image with a gcr.io tag and push it to the registry to get started.  Manage your images in the Google Developers Console.
* Local - If your cluster runs in Asia or Europe, you can now store your images in ASIA or EU specific repositories using asia.gcr.io and eu.gcr.io tags.
docker  registry  google  gcp  containers  cloud-storage  ops  deployment 
14 days ago by jm
Automated Nginx Reverse Proxy for Docker
Nice hack. An automated nginx reverse proxy which regenerates as the Docker containers update
nginx  reverse-proxy  proxies  web  http  ops  docker 
20 days ago by jm
Google Cloud Platform Blog: A look inside Google’s Data Center Networks
We used three key principles in designing our datacenter networks:
We arrange our network around a Clos topology, a network configuration where a collection of smaller (cheaper) switches are arranged to provide the properties of a much larger logical switch.
We use a centralized software control stack to manage thousands of switches within the data center, making them effectively act as one large fabric.
We build our own software and hardware using silicon from vendors, relying less on standard Internet protocols and more on custom protocols tailored to the data center.
clos-networks  google  data-centers  networking  sdn  gcp  ops 
20 days ago by jm
Why I dislike systemd
Good post, and hard to disagree.
One of the "features" of systemd is that it allows you to boot a system without needing a shell at all. This seems like such a senseless manoeuvre that I can't help but think of it as a knee-jerk reaction to the perception of Too Much Shell in sysv init scripts.
In exactly which universe is it reasonable to assume that you have a running D-Bus service (or kdbus) and a filesystem containing unit files, all the binaries they refer to, all the libraries they link against, and all the configuration files any of them reference, but that you lack that most ubiquitous of UNIX binaries, /bin/sh?
history  linux  unix  systemd  bsd  system-v  init  ops  dbus 
22 days ago by jm
VPC Flow Logs
we are introducing Flow Logs for the Amazon Virtual Private Cloud.  Once enabled for a particular VPC, VPC subnet, or Elastic Network Interface (ENI), relevant network traffic will be logged to CloudWatch Logs for storage and analysis by your own applications or third-party tools.

You can create alarms that will fire if certain types of traffic are detected; you can also create metrics to help you to identify trends and patterns. The information captured includes information about allowed and denied traffic (based on security group and network ACL rules). It also includes source and destination IP addresses, ports, the IANA protocol number, packet and byte counts, a time interval during which the flow was observed, and an action (ACCEPT or REJECT).
ec2  aws  vpc  logging  tracing  ops  flow-logs  network  tcpdump  packets  packet-capture 
22 days ago by jm
How We Moved Our API From Ruby to Go and Saved Our Sanity
Parse on their ditching-Rails story. I haven't heard a nice thing about Ruby or Rails as an operational, production-quality platform in a long time :(
go  ruby  rails  ops  parse  languages  platforms 
22 days ago by jm
etcd Clustering in AWS
'a fully-automated solution to build auto-scaling etcd clusters in AWS'
aws  cluster  docker  etcd  asg  autoscaling  ops 
25 days ago by jm
1172401 – Add Amazon root certificates
Well, well -- looks like AWS is about to disrupt PKI, and about time too. If they come up with a Plex-style "provision a cert" API, it'll be revolutionary
pki  ssl  tls  amazon  aws  apis  web-services  ops 
29 days ago by jm
Eric Brewer interview on Kubernetes
What is the relationship between Kubernetes, Borg and Omega (the two internal resource-orchestration systems Google has built)?

I would say, kind of by definition, there’s no shared code but there are shared people.

You can think of Kubernetes — especially some of the elements around pods and labels — as being lessons learned from Borg and Omega that are, frankly, significantly better in Kubernetes. There are things that are going to end up being the same as Borg — like the way we use IP addresses is very similar — but other things, like labels, are actually much better than what we did internally.

I would say that’s a lesson we learned the hard way.
google  architecture  kubernetes  docker  containers  borg  omega  deployment  ops 
7 weeks ago by jm
Deploy a registry - Docker Documentation
Looks like it's pretty feasible to run a private Docker registry on every host, backed by S3 (according to the ECS team's AMA). SPOF-free -- handy
docker  registry  ops  deployment  s3 
8 weeks ago by jm
Migration to, Expectations, and Advanced Tuning of G1GC
Bookmarking for future reference. recommended by one of the GC experts, I can't recall exactly who ;)
gc  g1gc  jvm  java  tuning  performance  ops  migration 
8 weeks ago by jm
Patterns for building a resilient and scalable microservices platform on AWS
Some good details from Boyan Dimitrov at Hailo, on their orchestration, deployment, provisioning infra they've built
deployment  ops  devops  hailo  microservices  platform  patterns  slides 
8 weeks ago by jm
Why Loggly loves Apache Kafka
Some good factoids about Loggly's Kafka usage and scales
scalability  logging  loggly  kafka  queueing  ops  reliabilty 
8 weeks ago by jm
Cassandra moving to using G1 as the default recommended GC implementation
This is a big indicator that G1 is ready for primetime. CMS has long been the go-to GC for production usage, but requires careful, complex hand-tuning -- if G1 is getting to a stage where it's just a case of giving it enough RAM, that'd be great.

Also, looks like it'll be the JDK9 default: https://twitter.com/shipilev/status/593175793255219200
cassandra  tuning  ops  g1gc  cms  gc  java  jvm  production  performance  memory 
9 weeks ago by jm
OWASP KeyBox
a web-based SSH console that centrally manages administrative access to systems. Web-based administration is combined with management and distribution of user's public SSH keys. Key management and administration is based on profiles assigned to defined users.

Administrators can login using two-factor authentication with FreeOTP or Google Authenticator . From there they can create and manage public SSH keys or connect to their assigned systems through a web-shell. Commands can be shared across shells to make patching easier and eliminate redundant command execution.
keybox  owasp  security  ssh  tls  ssl  ops 
10 weeks ago by jm
StackShare
'Discover and discuss the best dev tools and cloud infrastructure services' -- fun!
stackshare  architecture  stack  ops  software  ranking  open-source 
10 weeks ago by jm
Kubernetes compared to Borg
'Here are four Kubernetes features that came from our experiences with Borg.'
google  ops  kubernetes  borg  containers  docker  networking 
10 weeks ago by jm
Cluster-Based Architectures Using Docker and Amazon EC2 Container Service
In this post, we’re going to take a deeper dive into the architectural concepts underlying cluster computing using container management frameworks such as ECS. We will show how these frameworks effectively abstract the low-level resources such as CPU, memory, and storage, allowing for highly efficient usage of the nodes in a compute cluster. Building on some of the concepts detailed in the earlier posts, we will discover why containers are such a good fit for this type of abstraction, and how the Amazon EC2 Container Service fits into the larger ecosystem of cluster management frameworks.
docker  aws  ecs  ec2  ops  hosting  containers  mesos  clusters 
10 weeks ago by jm
Amazon EC2 Container Service team AmA
a few answers here. Mostly people pointing out shortcomings and the team asking them to start a thread on their forum though :(
ec2  ecs  docker  aws  ops  ama  reddit 
10 weeks ago by jm
Etsy's Release Management process
Good info on how Etsy use their Deployinator tool, end-to-end.

Slide 11: git SHA is visible for each env, allowing easy verification of what code is deployed.

Slide 14: Code is deployed to "princess" staging env while CI tests are running; no need to wait for unit/CI tests to complete.

Slide 23: smoke tests of pre-prod "princess" (complete after 8 mins elapsed).

Slide 31: dashboard link for deployed code is posted during deploy; post-release prod smoke tests are run by Jenkins. (short ones! they complete in 42 seconds)
deployment  etsy  deploy  deployinator  princess  staging  ops  testing  devops  smoke-tests  production  jenkins 
10 weeks ago by jm
'Continuous Deployment: The Dirty Details'
Good slide deck from Etsy's Mike Brittain regarding their CD setup. Some interesting little-known details:

Slide 41: database schema changes are not CD'd -- they go out on "Schema change Thursdays".

Slide 44: only the webapp is CD'd -- PHP, Apache, memcache components (Etsy.com, support and back-office tools, developer API, gearman async worker queues). The external "services" are not -- databases, Solr/JVM search (rolling restarts), photo storage (filters, proxy cache, S3), payments (PCI-DSS, controlled access).

They avoid schema changes and breaking changes using an approach they call "non-breaking expansions" -- expose new version in a service interface; support multiple versions in the consumer. Example from slides 50-63, based around a database schema migration.

Slide 66: "dev flags" (rollout oriented) are promoted to "feature flags" (long lived degradation control).

Slide 71: some architectural philosophies: deploying is cheap; releasing is cheap; gathering data should be cheap too; treat first iterations as experiments.

Slide 102: "Canary pools". They have multiple pools of users for testing in production -- the staff pool, users who have opted in to see prototypes/beta stuff, 0-100% gradual phased rollout.
cd  deploy  etsy  slides  migrations  database  schema  ops  ci  version-control  feature-flags 
10 weeks ago by jm
Internet Scale Services Checklist
good aspirational checklist, inspired heavily by James Hamilton's seminal 2007 paper, "On Designing And Deploying Internet-Scale Services"
james-hamilton  checklists  ops  internet-scale  architecture  operability  monitoring  reliability  availability  uptime  aspirations 
11 weeks ago by jm
Pinball
Pinterest's Hadoop workflow manager; 'scalable, reliable, simple, extensible' apparently. Hopefully it allows upgrades of a workflow component without breaking an existing run in progress, like LinkedIn's Azkaban does :(
python  pinterest  hadoop  workflows  ops  pinball  big-data  scheduling 
11 weeks ago by jm
Keywhiz
'a secret management and distribution service [from Square] that is now available for everyone. Keywhiz helps us with infrastructure secrets, including TLS certificates and keys, GPG keyrings, symmetric keys, database credentials, API tokens, and SSH keys for external services — and even some non-secrets like TLS trust stores. Automation with Keywhiz allows us to seamlessly distribute and generate the necessary secrets for our services, which provides a consistent and secure environment, and ultimately helps us ship faster. [...]

Keywhiz has been extremely useful to Square. It’s supported both widespread internal use of cryptography and a dynamic microservice architecture. Initially, Keywhiz use decoupled many amalgamations of configuration from secret content, which made secrets more secure and configuration more accessible. Over time, improvements have led to engineers not even realizing Keywhiz is there. It just works. Please check it out.'
square  security  ops  keys  pki  key-distribution  key-rotation  fuse  linux  deployment  secrets  keywhiz 
11 weeks ago by jm
Yelp Product & Engineering Blog | True Zero Downtime HAProxy Reloads
Using tc and qdisc to delay SYNs while haproxy restarts. Definitely feels like on-host NAT between 2 haproxy processes would be cleaner and easier though!
linux  networking  hacks  yelp  haproxy  uptime  reliability  tcp  tc  qdisc  ops 
12 weeks ago by jm
Optimizing Java CMS garbage collections, its difficulties, and using JTune as a solution | LinkedIn Engineering
I like the sound of this -- automated Java CMS GC tuning, kind of like a free version of JClarity's Censum (via Miguel Ángel Pastor)
java  jvm  tuning  gc  cms  linkedin  performance  ops 
12 weeks ago by jm
outbrain/gruffalo
an asynchronous Netty based graphite proxy. It protects Graphite from the herds of clients by minimizing context switches and interrupts; by batching and aggregating metrics. Gruffalo also allows you to replicate metrics between Graphite installations for DR scenarios, for example.

Gruffalo can easily handle a massive amount of traffic, and thus increase your metrics delivery system availability. At Outbrain, we currently handle over 1700 concurrent connections, and over 2M metrics per minute per instance.
graphite  backpressure  metrics  outbrain  netty  proxies  gruffalo  ops 
12 weeks ago by jm
Introducing Vector: Netflix's On-Host Performance Monitoring Tool
It gives pinpoint real-time performance metric visibility to engineers working on specific hosts -- basically sending back system-level performance data to their browser, where a client-side renderer turns it into a usable dashboard. Essentially the idea is to replace having to ssh onto instances, run "top", systat, iostat, and so on.
vector  netflix  performance  monitoring  sysstat  top  iostat  netstat  metrics  ops  dashboards  real-time  linux 
april 2015 by jm
Gil Tene's "usual suspects" to reduce system-level hiccups/latency jitters in a Linux system
Based on empirical evidence (across many tens of sites thus far) and note-comparing with others, I use a list of "usual suspects" that I blame whenever they are not set to my liking and system-level hiccups are detected. Getting these settings right from the start often saves a bunch of playing around (and no, there is no "priority" to this - you should set them all right before looking for more advice...).
performance  latency  hiccups  gil-tene  tuning  mechanical-sympathy  hyperthreading  linux  ops 
april 2015 by jm
Outages, PostMortems, and Human Error 101
Good basic pres from John Allspaw, covering the basics of tier-one tech incident response -- defining the 5 severity levels; root cause analysis techniques (to Five-Whys or not); and the importance of service metrics
devops  monitoring  ops  five-whys  allspaw  slides  etsy  codeascraft  incident-response  incidents  severity  root-cause  postmortems  outages  reliability  techops  tier-one-support 
april 2015 by jm
Cassandra remote code execution hole (CVE-2015-0225)
Ah now lads.
Under its default configuration, Cassandra binds an unauthenticated
JMX/RMI interface to all network interfaces. As RMI is an API for the
transport and remote execution of serialized Java, anyone with access
to this interface can execute arbitrary code as the running user.
cassandra  jmx  rmi  java  ops  security 
april 2015 by jm
How We Scale VividCortex's Backend Systems - High Scalability
Excellent post from Baron Schwartz about their large-scale, 1-second-granularity time series database storage system
time-series  tsd  storage  mysql  sql  baron-schwartz  ops  performance  scalability  scaling  go 
march 2015 by jm
The Four Month Bug: JVM statistics cause garbage collection pauses (evanjones.ca)
Ugh, tying GC safepoints to disk I/O? bad idea:
The JVM by default exports statistics by mmap-ing a file in /tmp (hsperfdata). On Linux, modifying a mmap-ed file can block until disk I/O completes, which can be hundreds of milliseconds. Since the JVM modifies these statistics during garbage collection and safepoints, this causes pauses that are hundreds of milliseconds long. To reduce worst-case pause latencies, add the -XX:+PerfDisableSharedMem JVM flag to disable this feature. This will break tools that read this file, like jstat.
bugs  gc  java  jvm  disk  mmap  latency  ops  jstat 
march 2015 by jm
Transparent huge pages implicated in Redis OOM
A nasty real-world prod error scenario worsened by THPs:
jemalloc(3) extensively uses madvise(2) to notify the operating system that it's done with a range of memory which it had previously malloc'ed. The page size on this machine is 2MB because transparent huge pages are in use. As such, a lot of the memory which is being marked with madvise(..., MADV_DONTNEED) is within substantially smaller ranges than 2MB. This means that the operating system never was able to evict pages which had ranges marked as MADV_DONTNEED because the entire page has to be unneeded to allow a page to be reused. Despite initially looking like a leak, the operating system itself was unable to free memory because of madvise(2) and transparent huge pages. This led to sustained memory pressure on the machine and redis-server eventually getting OOM killed.
oom-killer  oom  linux  ops  thp  jemalloc  huge-pages  madvise  redis  memory 
march 2015 by jm
tcpcopy
"tees" all TCP traffic from one server to another. "widely used by companies in China"!
testing  benchmarking  performance  tcp  ip  tcpcopy  tee  china  regression-testing  stress-testing  ops 
march 2015 by jm
Heka
an open source stream processing software system developed by Mozilla. Heka is a “Swiss Army Knife” type tool for data processing, useful for a wide variety of different tasks, such as:

Loading and parsing log files from a file system.
Accepting statsd type metrics data for aggregation and forwarding to upstream time series data stores such as graphite or InfluxDB.
Launching external processes to gather operational data from the local system.
Performing real time analysis, graphing, and anomaly detection on any data flowing through the Heka pipeline.
Shipping data from one location to another via the use of an external transport (such as AMQP) or directly (via TCP).
Delivering processed data to one or more persistent data stores.


Via feylya on twitter. Looks potentially nifty
heka  mozilla  monitoring  metrics  via:feylya  ops  statsd  graphite  stream-processing 
march 2015 by jm
soundcloud/lhm
The Large Hadron Migrator is a tool to perform live database migrations in a Rails app without locking.

The basic idea is to perform the migration online while the system is live, without locking the table. In contrast to OAK and the facebook tool, we only use a copy table and triggers. The Large Hadron is a test driven Ruby solution which can easily be dropped into an ActiveRecord or DataMapper migration. It presumes a single auto incremented numerical primary key called id as per the Rails convention. Unlike the twitter solution, it does not require the presence of an indexed updated_at column.
migrations  database  sql  ops  mysql  rails  ruby  lhm  soundcloud  activerecord 
march 2015 by jm
uselessd
A project to reduce systemd to a base initd, process supervisor and transactional dependency system, while minimizing intrusiveness and isolationism. Basically, it’s systemd with the superfluous stuff cut out, a (relatively) coherent idea of what it wants to be, support for non-glibc platforms and an approach that aims to minimize complicated design. uselessd is still in its early stages and it is not recommended for regular use or system integration.


This may be the best option to evade the horrors of systemd.
init  linux  systemd  unix  ops  uselessd 
march 2015 by jm
Ubuntu To Officially Switch To systemd Next Monday - Slashdot
Jesus. This is going to be the biggest shitfest in the history of Linux...
linux  slashdot  ubuntu  systemd  init  unix  ops 
march 2015 by jm
What Color Is Your Xen?
What a mess.
What's faster: PV, HVM, HVM with PV drivers, PVHVM, or PVH? Cloud computing providers using Xen can offer different virtualization "modes", based on paravirtualization (PV), hardware virtual machine (HVM), or a hybrid of them. As a customer, you may be required to choose one of these. So, which one?
ec2  linux  performance  aws  ops  pv  hvm  xen  virtualization 
february 2015 by jm
ssls.com
"Cheap SSL certs from $4.99/yr" -- apparently recommended for cheap, low-end SSL certs
ssl  certs  security  https  ops 
february 2015 by jm
Performance Co-Pilot
System performance metrics framework, plugged by Netflix, open-source for ages
open-source  pcp  performance  system  metrics  ops  red-hat  netflix 
february 2015 by jm
pcp2graphite
A gateway script, now included in PCP
pcp2graphite  pcp  graphite  ops  metrics  system 
february 2015 by jm
Duplicate SSH Keys Everywhere
Poor hardware imaging practices, basically:
It looks like all devices with the fingerprint are Dropbear SSH instances that have been deployed by Telefonica de Espana. It appears that some of their networking equipment comes setup with SSH by default, and the manufacturer decided to re-use the same operating system image across all devices.
crypto  ssh  security  telefonica  imaging  ops  shodan 
february 2015 by jm
yahoo/kafka-manager
A tool for managing Apache Kafka. It supports the following :

Manage multiple clusters;
Easy inspection of cluster state (topics, brokers, replica distribution, partition distribution);
Run preferred replica election;
Generate partition assignments (based on current state of cluster);
Run reassignment of partition (based on generated assignments)
yahoo  kafka  ops  tools 
february 2015 by jm
0x74696d | Falling In And Out Of Love with DynamoDB, Part II
Good DynamoDB real-world experience post, via Mitch Garnaat. We should write up ours, although it's pretty scary-stuff-free by comparison
aws  dynamodb  storage  databases  architecture  ops 
february 2015 by jm
TL;DR: Cassandra Java Huge Pages
Al Tobey does some trial runs of -XX:+AlwaysPreTouch and -XX:+UseHugePages
jvm  performance  tuning  huge-pages  vm  ops  cassandra  java 
february 2015 by jm
NA Server Roadmap Update: PoPs, Peering, and the North Bridge
League of Legends has set up private network links to a variety of major US ISPs to avoid internet weather (via Nelson)
via:nelson  peering  games  networks  internet  ops  networking 
january 2015 by jm
How TCP backlog works in Linux
good description of the process
ip  linux  tcp  networking  backlog  ops 
january 2015 by jm
huptime
Nice trick -- wrap servers with a libc wrapper to intercept bind(2) and accept(2) calls, so that transparent restarts becode possible
linux  ops  servers  uptime  restarting  libc  bind  accept  sockets 
january 2015 by jm
Maintaining performance in distributed systems [slides]
Great slide deck from Elasticsearch on JVM/dist-sys performance optimization
performance  elasticsearch  java  jvm  ops  tuning 
january 2015 by jm
carbon-c-relay
A much better carbon-relay, written in C rather than Python. Linking as we've been using it in production for quite a while with no problems.
The main reason to build a replacement is performance and configurability. Carbon is single threaded, and sending metrics to multiple consistent-hash clusters requires chaining of relays. This project provides a multithreaded relay which can address multiple targets and clusters for each and every metric based on pattern matches.
graphite  carbon  c  python  ops  metrics 
january 2015 by jm
Facette
Really nice time series dashboarding app. Might consider replacing graphitus with this...
time-series  data  visualisation  graphs  ops  dashboards  facette 
january 2015 by jm
AWS Tips I Wish I'd Known Before I Started
Some good advice and guidelines (although some are just silly).
aws  ops  tips  advice  ec2  s3 
january 2015 by jm
Personalization at Spotify using Cassandra
Lots and lots of good detail into the Spotify C* setup (via Bill de hOra)
via:dehora  spotify  cassandra  replication  storage  ops 
january 2015 by jm
Why we don't use a CDN: A story about SPDY and SSL
All of our assets loaded via the CDN [to our client in Australia] in just under 5 seconds. It only took ~2.7s to get those same assets to our friends down under with SPDY. The performance with no CDN blew the CDN performance out of the water. It is just no comparison. In our case, it really seems that the advantages of SPDY greatly outweigh that of a CDN when it comes to speed.
cdn  spdy  nginx  performance  web  ssl  tls  optimization  multiplexing  tcp  ops 
january 2015 by jm
Secure Secure Shell
How to secure SSH, disabling insecure ciphers etc. (via Padraig)
via:pixelbeat  crypto  security  ssh  ops 
january 2015 by jm
EC2 Container Service Hands On
Sounds like a good start, but this isn't great:
There is no native integration with Autoscaling or ELBs.
ec2  containers  docker  ecs  ops 
december 2014 by jm
'Machine Learning: The High-Interest Credit Card of Technical Debt' [PDF]
Oh god yes. This is absolutely spot on, as you would expect from a Google paper -- at this stage they probably have accumulated more real-world ML-at-scale experience than anywhere else.

'Machine learning offers a fantastically powerful toolkit for building complex systems
quickly. This paper argues that it is dangerous to think of these quick wins
as coming for free. Using the framework of technical debt, we note that it is remarkably
easy to incur massive ongoing maintenance costs at the system level
when applying machine learning. The goal of this paper is highlight several machine
learning specific risk factors and design patterns to be avoided or refactored
where possible. These include boundary erosion, entanglement, hidden feedback
loops, undeclared consumers, data dependencies, changes in the external world,
and a variety of system-level anti-patterns.

[....]

'In this paper, we focus on the system-level interaction between machine learning code and larger systems
as an area where hidden technical debt may rapidly accumulate. At a system-level, a machine
learning model may subtly erode abstraction boundaries. It may be tempting to re-use input signals
in ways that create unintended tight coupling of otherwise disjoint systems. Machine learning
packages may often be treated as black boxes, resulting in large masses of “glue code” or calibration
layers that can lock in assumptions. Changes in the external world may make models or input
signals change behavior in unintended ways, ratcheting up maintenance cost and the burden of any
debt. Even monitoring that the system as a whole is operating as intended may be difficult without
careful design.

Indeed, a remarkable portion of real-world “machine learning” work is devoted to tackling issues
of this form. Paying down technical debt may initially appear less glamorous than research results
usually reported in academic ML conferences. But it is critical for long-term system health and
enables algorithmic advances and other cutting-edge improvements.'
machine-learning  ml  systems  ops  tech-debt  maintainance  google  papers  hidden-costs  development 
december 2014 by jm
« earlier      
per page:    204080120160

related tags

accept  accidents  acm  acm-queue  action-items  activemq  activerecord  admin  adrian-cockcroft  advent  advice  agpl  airbnb  aix  alarm-fatigue  alarming  alarms  alert-logic  alerting  alerts  alestic  algorithms  allspaw  alter-table  ama  amazon  analysis  analytics  anomaly-detection  antarctica  anti-spam  antipatterns  ap  apache  aphyr  apis  app-engine  apt  archaius  architecture  asg  asgard  aspirations  atlas  atomic  authentication  auto-scaling  automation  autoremediation  autoscaling  availability  aws  az  azure  backblaze  backlog  backpressure  backup  backups  banking  baron-schwartz  bash  basho  batch  bdb  bdb-je  bdd  beanstalk  ben-treynor  benchmarking  benchmarks  best-practices  big-data  billing  bind  bit-errors  bitcoin  bitly  bitrot  bloat  blockdev  blogs  blue-green-deployments  boot2docker  borg  boundary  broadcast  bsd  bugs  build  build-out  bureaucracy  c  ca  ca-7  caching  campaigns  canary-requests  cap  cap-theorem  capacity  carbon  case-studies  cassandra  cd  cdn  censum  certs  cfengine  cgroups  change-management  change-monitoring  changes  chaos-monkey  charts  checkip  checklists  chef  chefspec  china  ci  circuit-breakers  circus  cisco  classification  classifiers  cleaner  cli  clos-networks  cloud  cloud-storage  cloudera  cloudflare  cloudnative  cloudwatch  cluster  clustering  clusters  cms  code-spaces  codeascraft  codedeploy  coding  coinbase  cold  collaboration  command-line  commandline  commercial  company  compatibility  complexity  compression  concurrency  conferences  confidence-bands  configuration  consistency  consul  containerization  containers  continuous-delivery  continuous-deployment  continuous-integration  continuousintegration  copy-on-write  copyright  coreutils  corruption  cp  crash-only-software  criu  cron  crypto  culture  curl  daemon  daemons  dashboards  data  data-centers  data-corruption  database  databases  datacenters  dataviz  dbus  debian  debug  debugging  decay  defrag  delete  delivery  delta  demo  dependencies  deploy  deployinator  deployment  desktops  dev  developers  development  devops  diagnosis  digital-ocean  disk  disks  distcomp  distributed  distributed-systems  distros  diy  dmca  dns  docker  documentation  dotcloud  drivers  dropbox  dstat  duplicity  duply  dynamic-configuration  dynamodb  dynect  ebs  ec2  ecs  elastic-scaling  elasticsearch  elb  email  emr  encryption  engineering  ensemble  erasure-coding  etcd  etsy  eureka  event-management  eventual-consistency  exception-handling  exercises  exponential-decay  extortion  fabric  facebook  facette  fail  failover  failure  false-positives  fault-tolerance  fcron  feature-flags  fedora  file-transfer  filesystems  fincore  firefighting  five-whys  flapjack  flock  flow-logs  forecasting  foursquare  freebsd  front-ends  fs  fsync  ftrace  fuse  g1  g1gc  gae  game-days  games  gating  gc  gcp  gil-tene  gilt  gilt-groupe  git  github  gnome  go  god  google  gossip  grafana  graphing  graphite  graphs  gruffalo  gzip  ha  hacks  hadoop  hailo  haproxy  hardware  hbase  hdds  hdfs  heap  heartbeats  heka  hero-coder  hero-culture  hiccups  hidden-costs  history  hn  holt-winters  home  honeypot  hosting  hotspot  hrd  http  https  huge-pages  humor  hvm  hyperthreading  hystrix  iam  ian-wilkes  ibm  icecube  images  imaging  inactivity  incident-response  incidents  inept  influxdb  infrastructure  init  inspeqtor  instrumentation  integration-tests  internet  internet-scale  interviews  inviso  io  iops  iostat  ioutil  ip  iptables  ironfan  james-hamilton  java  javascript  jay-kreps  jcmd  jdk  jemalloc  jenkins  jepsen  jmx  jmxtrans  john-allspaw  jstat  juniper  jvm  kafka  kdd  kde  kellabyte  kernel  key-distribution  key-rotation  keybox  keys  keywhiz  kill-9  knife  kubernetes  lambda  languages  laptops  latency  legacy  leveldb  lhm  libc  lifespan  limits  linden  linkedin  links  linode  linux  live  load  load-balancers  load-balancing  locking  logentries  logging  loggly  loose-coupling  lsb  lsof  lsx  luks  lxc  mac  machine-learning  macosx  madvise  mail  maintainance  mandos  map-reduce  mapreduce  measurements  mechanical-sympathy  memory  mesos  metrics  mfa  microservices  microsoft  migration  migrations  mincore  mirroring  mit  ml  mmap  mongodb  monit  monitorama  monitoring  movies  mozilla  mtbf  multiplexing  mysql  nagios  namespaces  nannies  nas  natwest  nerve  netflix  netstat  netty  network  network-monitoring  network-partitions  networking  networks  nginx  nix  nixos  nixpkgs  node.js  nosql  notification  notifications  npm  ntp  ntpd  obama  omega  omniti  oom  oom-killer  open-source  openjdk  operability  operations  ops  optimization  organisations  os  oss  osx  ouch  out-of-band  outage  outages  outbrain  outsourcing  owasp  packaging  packet-capture  packets  page-cache  pager-duty  pagerduty  pages  paging  papers  parse  partition  partitions  passenger  patterns  paxos  pbailis  pcp  pcp2graphite  pdf  peering  percona  performance  phusion  pie  pillar  pinball  pinterest  piops  pixar  pki  platform  platforms  plumbr.eu  post-mortems  postgres  postmortems  presentations  pricing  princess  prioritisation  procedures  process  processes  procfs  production  profiling  programming  provisioning  proxies  pty  puppet  pv  python  qa  qdisc  questions  queueing  rabbitmq  race-conditions  raid  rails  rami-rosen  randomization  ranking  rant  rate-limiting  rbs  rdbms  rds  real-time  recovery  red-hat  reddit  redis  redshift  refactoring  reference  registry  regression-testing  reinvent  release  reliability  reliabilty  remediation  replicas  replication  resiliency  restarting  restoring  reverse-proxy  reviews  rewrites  riak  riemann  risks  rm-rf  rmi  rocket  rollback  root-cause  root-causes  route53  routing  rspec  ruby  runbooks  runit  rvm  s3  s3funnel  s3ql  safety  sanity-checks  scala  scalability  scaling  scheduler  scheduling  schema  scripts  sdd  sdn  seagate  search  secrets  security  sensu  serf  serialization  server  servers  serverspec  service-discovery  service-metrics  service-registry  services  ses  sev1  severity  sharding  shodan  shopify  shorn-writes  silos  sjk  slashdot  sleep  slew  slides  smartstack  smoke-tests  smtp  snappy  snapshots  sns  soa  sockets  software  solaris  soundcloud  south-pole  space  spark  spdy  speculative-execution  split-brain  spot-instances  spotify  sql  square  sre  ssd  ssh  ssl  stack  stack-size  stackshare  staging  startup  statistics  stats  statsd  statsite  stephanie-dean  stepping  storage  storm  strace  stream-processing  streaming  stress-testing  strider  stripe  supervision  supervisord  support  survey  svctm  syadmin  synapse  sysadmin  sysadvent  sysdig  syslog  sysstat  system  system-testing  system-v  systemd  systems  tahoe-lafs  talks  tc  tcp  tcpcopy  tcpdump  tdd  teams  tech  tech-debt  techops  tee  telefonica  telemetry  testing  thp  threadpools  threads  throughput  thundering-herd  tier-one-support  tildeslash  time  time-machine  time-series  time-synchronization  tips  tls  tools  top  tos  trace  tracer-requests  tracing  trading  training  transactional-updates  transparent-huge-pages  troubleshooting  tsd  tuning  turing-complete  twilio  twisted  twitter  two-factor-authentication  uat  ubuntu  ubuntu-core  ui  ulster-bank  ultradns  unicorn  unit-testing  unit-tests  unix  upgrades  upstart  uptime  uselessd  usenix  vagrant  vector  version-control  versioning  via:aphyr  via:bill-dehora  via:chughes  via:codeslinger  via:dave-doran  via:dehora  via:fanf  via:feylya  via:filippo  via:jk  via:kragen  via:lusis  via:martharotter  via:nelson  via:pdolan  via:pixelbeat  virtualisation  virtualization  visualisation  vm  vms  voldemort  vpc  web  web-services  webmail  weighting  whats-my-ip  wiki  wipac  work  workflows  xen  yahoo  yammer  yelp  zfs  zipkin  zonify  zookeeper  zooko 

Copy this bookmark:



description:


tags: