jm + production   21

Why did Apple, Amazon, Google stocks crash to the same price today?
Nasdaq said in a statement that "certain third parties improperly propagated test data that was distributed as part of the normal evening test procedures."

"For July 3, 2017, all production data was completed by 5:16 PM as expected per the early close of the markets," the statement continued. "Any data messages received post 5:16 PM should be deemed as test data and purged from direct data recipient's databases. UTP (Unlisted Trading Privileges) is asking all third parties to revert to Nasdaq Official Closing Prices effective at 5:16 PM."
testing  fail  stock-markets  nasdaq  test-data  test  production  integration-testing  test-in-prod 
11 weeks ago by jm
'Rules of Machine Learning: Best Practices for ML Engineering' from Martin Zinkevich
'This document is intended to help those with a basic knowledge of machine learning get the benefit of best practices in machine learning from around Google. It presents a style for machine learning, similar to the Google C++ Style Guide and other popular guides to practical programming. If you have taken a class in machine learning, or built or worked on a machine­-learned model, then you have the necessary background to read this document.'

Full of good tips, if you wind up using ML in a production service.
machine-learning  ml  google  production  coding  best-practices 
january 2017 by jm
Etsy Debriefing Facilitation Guide
by John Allspaw, Morgan Evans and Daniel Schauenberg; the Etsy blameless postmortem style crystallized into a detailed 27-page PDF ebook
etsy  postmortems  blameless  ops  production  debriefing  ebooks 
november 2016 by jm
Commodore 64C going strong after over 25 years in production
C64C used in Polish auto repair shop to balance driveshafts, working non-stop for over a quarter of a century. I'd like to see a Spectrum do _this_ ;)
c64  c64c  commodore  history  poland  auto-repair  production 
september 2016 by jm
QA Instability Implies Production Instability
Invariably, when I see a lot of developer effort in production support I also find an unreliable QA environment. It is both unreliable in that it is frequently not available for testing, and unreliable in the sense that the system’s behavior in QA is not a good predictor of its behavior in production.
qa  testing  architecture  patterns  systems  production 
july 2016 by jm
Ruby in Production: Lessons Learned — Medium
Based on the pain we've had trying to bring our Rails services up to the quality levels required, this looks pretty accurate in many respects. I'd augment this advice by saying: avoid RVM; use Docker.
rvm  docker  ruby  production  rails  ops 
march 2016 by jm
"Hidden Technical Debt in Machine-Learning Systems" [pdf]
Another great paper about from Google, talking about the tradeoffs that must be considered in practice over the long term with running a complex ML system in production.
technical-debt  ml  machine-learning  ops  software  production  papers  pdf  google 
december 2015 by jm
You're probably wrong about caching
Excellent cut-out-and-keep guide to why you should add a caching layer. I've been following this practice for the past few years, after I realised that #6 (recovering from a failed cache is hard) is a killer -- I've seen a few large-scale outages where a production system had gained enough scale that it required a cache to operate, and once that cache was damaged, bringing the system back online required a painful rewarming protocol. Better to design for the non-cached case if possible.
architecture  caching  coding  design  caches  ops  production  scalability 
september 2015 by jm
Implementing Efficient and Reliable Producers with the Amazon Kinesis Producer Library - AWS Big Data Blog
Good advice on production-quality, decent-scale usage of Kinesis in Java with the official library: batching, retries, partial failures, backoff, and monitoring. (Also, jaysus, the AWS Cloudwatch API is awful, looking at this!)
kpl  aws  kinesis  tips  java  batching  streaming  production  cloudwatch  monitoring  coding 
august 2015 by jm
Cassandra moving to using G1 as the default recommended GC implementation
This is a big indicator that G1 is ready for primetime. CMS has long been the go-to GC for production usage, but requires careful, complex hand-tuning -- if G1 is getting to a stage where it's just a case of giving it enough RAM, that'd be great.

Also, looks like it'll be the JDK9 default: https://twitter.com/shipilev/status/593175793255219200
cassandra  tuning  ops  g1gc  cms  gc  java  jvm  production  performance  memory 
april 2015 by jm
Etsy's Release Management process
Good info on how Etsy use their Deployinator tool, end-to-end.

Slide 11: git SHA is visible for each env, allowing easy verification of what code is deployed.

Slide 14: Code is deployed to "princess" staging env while CI tests are running; no need to wait for unit/CI tests to complete.

Slide 23: smoke tests of pre-prod "princess" (complete after 8 mins elapsed).

Slide 31: dashboard link for deployed code is posted during deploy; post-release prod smoke tests are run by Jenkins. (short ones! they complete in 42 seconds)
deployment  etsy  deploy  deployinator  princess  staging  ops  testing  devops  smoke-tests  production  jenkins 
april 2015 by jm
Is Docker ready for production? Feedbacks of a 2 weeks hands on
I have to agree with this assessment -- there are a lot of loose ends still for production use of Docker in a SOA stack environment:
From my point of view, Docker is probably the best thing I’ve seen in ages to automate a build. It allows to pre build and reuse shared dependencies, ensuring they’re up to date and reducing your build time. It avoids you to either pollute your Jenkins environment or boot a costly and slow Virtualbox virtual machine using Vagrant. But I don’t feel like it’s production ready in a complex environment, because it adds too much complexity. And I’m not even sure that’s what it was designed for.
docker  complexity  devops  ops  production  deployment  soa  web-services  provisioning  networking  logging 
october 2014 by jm
Netflix release new code to production before completing tests
Interesting -- I hadn't heard of this being an official practise anywhere before (although we actually did it ourselves this week)...
If a build has made it [past the 'integration test' phase], it is ready to be deployed to one or more internal environments for user-acceptance testing. Users could be UI developers implementing a new feature using the API, UI Testers performing end-to-end testing or automated UI regression tests. As far as possible, we strive to not have user-acceptance tests be a gating factor for our deployments. We do this by wrapping functionality in Feature Flags so that it is turned off in Production while testing is happening in other environments. 
devops  deployment  feature-flags  release  testing  integration-tests  uat  qa  production  ops  gating  netflix 
october 2014 by jm
Twitter tech talk video: "Profiling Java In Production"
In this talk Kaushik Srenevasan describes a new, low overhead, full-stack tool (based on the Linux perf profiler and infrastructure built into the Hotspot JVM) we've built at Twitter to solve the problem of dynamically profiling and tracing the behavior of applications (including managed runtimes) in production.


Looks very interesting. Haven't watched it yet though
twitter  tech-talks  video  presentations  java  jvm  profiling  testing  monitoring  service-metrics  performance  production  hotspot  perf 
december 2013 by jm
The first pillar of agile sysadmin: We alert on what we draw
'One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. ' .... 'There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.' I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don't sync up.
devops  monitoring  deployment  production  sysadmin  ops  alerting  metrics 
march 2013 by jm
SpaceX software dev practices
Metrics rule the roost -- I guess there's been a long history of telemetry in space applications.

To make software more visible, you need to know what it is doing, he said, which means creating "metrics on everything you can think of".... Those metrics should cover areas like performance, network utilization, CPU load, and so on.

The metrics gathered, whether from testing or real-world use, should be stored as it is "incredibly valuable" to be able to go back through them, he said. For his systems, telemetry data is stored with the program metrics, as is the version of all of the code running so that everything can be reproduced if needed.

SpaceX has programs to parse the metrics data and raise an alarm when "something goes bad". It is important to automate that, Rose said, because forcing a human to do it "would suck". The same programs run on the data whether it is generated from a developer's test, from a run on the spacecraft, or from a mission. Any failures should be seen as an opportunity to add new metrics. It takes a while to "get into the rhythm" of doing so, but it is "very useful". He likes to "geek out on error reporting", using tools like libSegFault and ftrace.

Automation is important, and continuous integration is "very valuable", Rose said. He suggested building for every platform all of the time, even for "things you don't use any more". SpaceX does that and has found interesting problems when building unused code. Unit tests are run from the continuous integration system any time the code changes. "Everyone here has 100% unit test coverage", he joked, but running whatever tests are available, and creating new ones is useful. When he worked on video games, they had a test to just "warp" the character to random locations in a level and had it look in the four directions, which regularly found problems.

"Automate process processes", he said. Things like coding standards, static analysis, spaces vs. tabs, or detecting the use of Emacs should be done automatically. SpaceX has a complicated process where changes cannot be made without tickets, code review, signoffs, and so forth, but all of that is checked automatically. If static analysis is part of the workflow, make it such that the code will not build unless it passes that analysis step.

When the build fails, it should "fail loudly" with a "monitor that starts flashing red" and email to everyone on the team. When that happens, you should "respond immediately" to fix the problem. In his team, they have a full-size Justin Bieber cutout that gets placed facing the team member who broke the build. They found that "100% of software engineers don't like Justin Bieber", and will work quickly to fix the build problem.
spacex  dev  coding  metrics  deplyment  production  space  justin-bieber 
march 2013 by jm
Percona Playback's tcpdump plugin
Capture MySQL traffic via tcpdump, tee it over the network to replay against a second database. Even supports query execution times and pauses between queries to playback the same load level
tcpdump  production  load-testing  testing  staging  tee  networking  netcat  percona  replay  mysql 
march 2013 by jm
GitHub outage post-mortem
continuous-integration system was accidentally run against the production db. result: the entire production database got wiped. ouuuuch
ouch  github  outages  post-mortem  databases  testing  c-i  production  firewalls  from delicious
november 2010 by jm

related tags

alerting  architecture  auto-repair  aws  batching  best-practices  blameless  c-i  c64  c64c  caches  caching  canary-requests  cassandra  cloudwatch  cms  coding  commodore  complexity  containers  databases  debriefing  debugging  deploy  deployinator  deployment  deplyment  design  dev  devops  distcomp  distributed-systems  docker  ebooks  etsy  fail  feature-flags  firewalls  g1gc  gating  gc  github  google  history  hotspot  http  infrastructure  integration-testing  integration-tests  java  jenkins  justin-bieber  jvm  kinesis  kpl  kubernetes  live  load-testing  logging  machine-learning  memory  metrics  ml  monitoring  mysql  nasdaq  netcat  netflix  networking  ops  ouch  outages  papers  patterns  pdf  percona  perf  performance  poland  post-mortem  postmortems  presentations  princess  production  profiling  provisioning  qa  rails  release  replay  ruby  rvm  scalability  service-metrics  shopify  smoke-tests  soa  software  space  spacex  stack  staging  stock-markets  streaming  sysadmin  systems  tcpdump  tech-talks  technical-debt  tee  test  test-data  test-in-prod  testing  tips  tracer-requests  tracing  tuning  twitter  uat  video  web-services  zipkin 

Copy this bookmark:



description:


tags: