jm + sysadmin   45

My Philosophy on Alerting
'based my observations while I was a Site Reliability Engineer at Google', courtesy of Rob Ewaschuk <>. Seem pretty reasonable
monitoring  sysadmin  alerting  alerts  nagios  pager  ops  sre  rob-ewaschuk 
july 2016 by jm
USE Method: Linux Performance Checklist
Really late in bookmarking this, but has some up-to-date sample commandlines for sar, mpstat and iostat on linux
linux  sar  iostat  mpstat  cli  ops  sysadmin  performance  tuning  use  metrics 
june 2016 by jm
A curated list of Docker resources.
linux  sysadmin  docker  ops  devops  containers  hosting 
november 2014 by jm
curl | sh
'People telling people to execute arbitrary code over the network. Run code from our servers as root. But HTTPS, so it’s no biggie.'

humor  sysadmin  ops  security  curl  bash  npm  rvm  chef 
november 2014 by jm
a simple, lightweight HTTP server for storing and distributing custom Debian packages around your organisation. It is designed to make it as easy as possible to use Debian packages for code deployments and to ease other system administration tasks.
debian  apt  sysadmin  linux  ops  packaging 
october 2014 by jm
'a system for allowing servers with encrypted root file systems to reboot unattended and/or remotely.' (via Tony Finch)
via:fanf  mandos  encryption  security  server  ops  sysadmin  linux 
october 2014 by jm
Applying cardiac alarm management techniques to your on-call
An ops-focused take on a recent story about alarm fatigue, and how a Boston hospital dealt with it. When I was in Amazon, many of the teams in our division had a target to reduce false positive pages, with a definite monetary value attached to it, since many teams had "time off in lieu" payments for out-of-hours pages to the on-call staff. As a result, reducing false-positive pages was reasonably high priority and we dealt with this problem very proactively, with a well-developed sense of how to do so. It's interesting to see how the outside world is only just starting to look into its amelioration. (Another benefit of a TOIL policy ;)
ops  monitoring  sysadmin  alerts  alarms  nagios  alarm-fatigue  false-positives  pages 
september 2014 by jm
Dead Man's Snitch
a cron job monitoring tool that keeps an eye on your periodic processes and notifies you when something doesn't happen. Daily backups, monthly emails, or cron jobs you need to monitor? Dead Man's Snitch has you covered. Know immediately when one of these processes doesn't work.

via Marc.
alerts  cron  monitoring  sysadmin  ops  backups  alarms 
april 2014 by jm
The little ssh that (sometimes) couldn't - Mina Naguib
A good demonstration of what it looks like when network-level packet corruption occurs on a TCP connection
ssh  sysadmin  networking  tcp  bugs  bit-flips  cosmic-rays  corruption  packet 
april 2014 by jm
Load Balancer Testing with a Honeypot Daemon
nice post on writing BDD unit tests for infrastructure, in this case specifically a load balancer (via Devops Weekly)
load-balancers  ops  devops  sysadmin  testing  unit-tests  networking  honeypot  infrastructure  bdd 
december 2013 by jm
Docker: Git for deployment
Docker is to deployment as Git is to development.

Developers are able to leverage Git's performance and flexibility when building applications. Git encourages experiments and doesn't punish you when things go wrong: start your experiments in a branch, if things fall down, just git rebase or git reset. It's easy to start a branch and fast to push it.

Docker encourages experimentation for operations. Containers start quickly. Building images is a snap. Using another images as a base image is easy. Deploying whole images is fast, and last but not least, it's not painful to rollback.

Fast + flexible = deployments are about to become a lot more enjoyable.
docker  deployment  sysadmin  ops  devops  vms  vagrant  virtualization  containers  linux  git 
august 2013 by jm
How to configure ntpd so it will not move time backwards
The "-x" switch will expand the step/slew boundary from 128ms to 600 seconds, ensuring the time is slewed (drifted slowly towards the correct time at a max of 5ms per second) rather than "stepped" (a sudden jump, potentially backwards). Since slewing has a max of 5ms per second, time can never "jump backwards", which is important to avoid some major application bugs (particularly in Java timers).
ntpd  time  ntp  ops  sysadmin  slew  stepping  time-synchronization  linux  unix  java  bugs 
august 2013 by jm
metric collectors for various stuff not (or poorly) handled by other monitoring daemons

Core of the project is a simple daemon (harvestd), which collects metric values and sends them to graphite carbon daemon (and/or other configured destinations) once per interval. Includes separate data collection components ("collectors") for processing of:

/proc/slabinfo for useful-to-watch values, not everything (configurable).
/proc/vmstat and /proc/meminfo in a consistent way.
/proc/stat for irq, softirq, forks.
/proc/buddyinfo and /proc/pagetypeinfo (memory fragmentation).
/proc/interrupts and /proc/softirqs.
Cron log to produce start/finish events and duration for each job into a separate metrics, adapts jobs to metric names with regexes.
Per-system-service accounting using systemd and it's cgroups.
sysstat data from sadc logs (use something like sadc -F -L -S DISK -S XDISK -S POWER 60 to have more stuff logged there) via sadf binary and it's json export (sadf -j, supported since sysstat-10.0.something, iirc).
iptables rule "hits" packet and byte counters, taken from ip{,6}tables-save, mapped via separate "table chain_name rule_no metric_name" file, which should be generated along with firewall rules (I use this script to do that).

Pretty exhaustive list of system metrics -- could have some interesting ideas for Linux OS-level metrics to monitor in future.
graphite  monitoring  metrics  unix  linux  ops  vm  iptables  sysadmin 
june 2013 by jm
Care and Feeding of Large Scale Graphite Installations [slides]
good docs for large-scale graphite use: 'Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01'
graphite  devops  ops  metrics  dashboards  sysadmin 
june 2013 by jm
Measuring & Optimizing I/O Performance
Another good writeup on iostat and EBS, from Ilya Grigorik
io  optimization  sysadmin  performance  iostat  ebs  aws  ops 
may 2013 by jm
TCP Tune
These notes are intended to help users and system administrators maximize TCP/IP performance on their computer systems. They summarize all of the end-system (computer system) network tuning issues including a tutorial on TCP tuning, easy configuration checks for non-experts, and a repository of operating system specific instructions for getting the best possible network performance on these platforms.

Some tips for maximizing HPC network performance for the intra-DC case; recommended by the LinkedIn Kafka operations page.
tuning  network  tcp  sysadmin  performance  ops  kafka  ec2 
april 2013 by jm
The first pillar of agile sysadmin: We alert on what we draw
'One of [the] purposes of monitoring systems was to provide data to allow us, as engineers, to detect patterns, and predict issues before they become production impacting. In order to do this, we need to be capturing data and storing it somewhere which allows us to analyse it. If we care about it - if the data could provide the kind of engineering insight which helps us to understand our systems and give early warning - we should be capturing it. ' .... 'There are a couple of weaknesses in [Nagios' design]. Assuming we’ve agreed that if we care about a metric enough to want to alert on it then we should be gathering that data for analysis, and graphing it, then we already have the data upon which to base our check. Furthermore, this data is not on the machine we’re monitoring, so our checks don’t in any way add further stress to that machine.' I would add that if we are alerting on a different set of data from what we collect for graphing, then using the graphs to investigate an alarm may run into problems if they don't sync up.
devops  monitoring  deployment  production  sysadmin  ops  alerting  metrics 
march 2013 by jm
First 5 Minutes Troubleshooting A Server
quite a good checklist of first steps for troubleshooting. Worth bookmarking for "dstat --top-io --top-bio" alone, which is an absolutely excellent tool and new to me
dstat  server  io  disks  hardware  performance  linux  sysadmin  ops  troubleshooting  checklists  root-cause 
march 2013 by jm
Test-Driven Infrastructure with Chef
Interesting idea.
The book introduces “Infrastructure as Code,” test-driven development, Chef, and cucumber-chef, and then proceeds to a simple example using Chef to provision a shared Linux server. The recipes for the server are developed test-first, demonstrating both the technique and the workflow.
tdd  chef  server  provisioning  build  deploy  linux  coding  ops  sysadmin 
march 2013 by jm
AWS Advent 2012
'an annual exploration of Amazon Web Services.' Some great hacks here
aws  amazon  advent  sysadmin  s3  ec2  chef  puppet  ops 
december 2012 by jm
Shell Scripts Are Like Gremlins
Shell Scripts are like Gremlins. You start out with one adorably cute shell script. You commented it and it does one thing really well. It’s easy to read, everyone can use it. It’s awesome! Then you accidentally spill some water on it, or feed it late one night and omgwtf is happening!?

+1. I have to wean myself off the habit of automating with shell scripts where a clean, well-unit-tested piece of code would work better.
shell-scripts  scripting  coding  automation  sysadmin  devops  chef  deployment 
december 2012 by jm
'SSH-Based Configuration Management & Deployment'. deploy via SSH; no target-side daemons required. GPLv3 licensed, unfortunately :(
ansible  devops  configuration  deployment  sysadmin  python  ssh 
july 2012 by jm
how to restore from iCloud backup
the trick: don't try and do it through iTunes, it won't give you the option, apparently. I have a carrier unlock, and apparently need to wipe the phone for it to take place; this scares the crap out of me
backup  iphone  restore  sysadmin  phones  icloud  apple  howto 
june 2012 by jm
Autometrics: Self-service metrics collection
how LinkedIn built a service-metrics collection and graphing infrastructure using Kafka and Zookeeper, writing to RRD files, handling 8.8k metrics per datacenter per second
kafka  zookeeper  linkedin  sysadmin  service-metrics 
february 2012 by jm
Taming the OOM killer []
hmm, I never knew about oom_adj, useful (via Peter Blair)
via:petermblair  oom  linux  memory  oom-killer  sysadmin  lwn  from delicious
january 2011 by jm
'Free open source self-hosted log management and exception tracking', loggly-style.  Basically, a nifty web data-mining UI on your syslogs (via adulau)
logging  syslog  sysadmin  mongodb  opensource  via:adulau  logs  web  ui  data-mining  from delicious
january 2011 by jm
MySQL/PostgreSQL admin helper tools -- check replication status, archive, analyse logs, find deadlocks
sysadmin  db  mysql  replication  maatkit  dba  from delicious
october 2010 by jm
wraps strace(1) to summarise and aggregate I/O ops performed by a Linux process. looks pretty nifty (via Jeremy Zawodny)
via:jzawodny  io  strace  linux  monitoring  debugging  performance  profiling  sysadmin  ioprofile  unix  tools  from delicious
october 2010 by jm
Foursquare MongoDB outage post mortem
MongoDB was set up to write to RAM if possible, omitting immediate writes to disk -- but then the db size exceeded RAM size, the disk was hit, imposing a massive slowdown and creating a huge backlog immediately, bringing the site down (via Nelson)
via:nelson  mongodb  sharding  nosql  ouch  outage  foursquare  sysadmin  ops  from delicious
october 2010 by jm
Mongrel2 Says, "Goodbye Python"
Linux distros ship ancient Python interpreters, hence it's impossible to rely on recent language features because they won't be there, making it useless to write code in Python. We have similar problems in perl-land, but it's easy enough to get by without the latest-and-greatest; maybe Python is different in that regard? ... or is it Zed?
zed-shaw  python  mongrel  distros  linux  sysadmin  packaging  from delicious
september 2010 by jm
Mac OS X command-line tricks
not quite up to par with modern Ubuntu, but still a few interesting ones here for when I'm stuck using the missus' laptop ;)
apple  bash  cli  osx  mac  sysadmin  shell  tricks  command-line  from delicious
july 2010 by jm
practical Linux commands quick-ref sheet
from Padraig Brady. lots of nice one-liners I wasn't familiar with
padraig-brady  bash  cli  linux  reference  sysadmin  tips  commands  from delicious
june 2010 by jm
pwnat - NAT to NAT client-server communication
'a proxy server that works behind a NAT, even when the client is behind a NAT, without any 3rd party'. nifty, by Samy "MySpace worm" Kamkar
samy-kamkar  apps  firewall  ip  nat  networking  pwnat  stun  traversal  tcp  sysadmin  tunneling  udp  from delicious
march 2010 by jm
'Lsyncd uses rsync to synchronize local directories with a remote machine running rsyncd. Lsyncd watches multiple directories trees through inotify. The first step after adding the watches is to rsync all directories with the remote host, and then sync single file by collecting the inotify events. So lsyncd is a light-weight live mirror solution that should be easy to install and use while blending well with your system.' (via adulau)
via:adulau  lsyncd  mirroring  linux  inotify  backup  sysadmin  synchronization  sync  dropbox  from delicious
december 2009 by jm
Postfix - (almost) a satellite system
how to keep a small number of user accounts (ie. root) delivering locally while the rest are delivered to a smarthost
postfix  sysadmin  unix  mail  mta  smtp  from delicious
september 2009 by jm
A short history of btrfs []
wow, sounds good! looking forward to this hitting production-ready status
btrfs  history  zfs  linux  open-source  licensing  storage  sysadmin  b-trees  b+trees  algorithms  fs  filesystems 
august 2009 by jm
Public SSL Server Database
'an online service that enables you to look up the configuration of any public SSL web server. The configuration of known public SSL web servers will be periodically inspected and the results recorded. This service relies on the SSL Server Rating guide for the assessment'
ssl  grades  security  tls  https  servers  sysadmin  ssl-labs 
july 2009 by jm
Infrastructures.Org: Best Practices in Automated Systems Administration and Infrastructure Architecture: Gold Server
well-written, and it's good to see version control listed right at the top of the list. But quite dead; interesting for historical reasons only at this stage
via:fanf  deployment  sysadmin  unix  rsync  ssh  cvs  infrastructure  cfengine 
july 2009 by jm
glTail.rb - realtime logfile visualization
'View real-time data and statistics from any logfile on any server with SSH, in an intuitive and entertaining way', supporting postfix/spamd/clamd logs among loads of others. very cool if a little silly
dataviz  visualization  tail  gltail  opengl  linux  apache  spamd  spamassassin  logs  statistics  sysadmin  analytics  animation  analysis  server  ruby  monitoring  logging  logfiles 
july 2009 by jm

related tags

advent  alarm-fatigue  alarms  alerting  alerts  algorithms  alt-sysadmin-recovery  amazon  analysis  analytics  animation  ansible  apache  apple  apps  apt  asr  automation  aws  b+trees  b-trees  backup  backups  bash  bdd  bit-flips  branch  branching  btrfs  bugs  build  cfengine  checklists  chef  cli  coding  command-line  commands  configuration  containers  continuousintegration  corruption  cosmic-rays  cron  curl  cvs  dashboards  data-mining  dataviz  db  dba  debian  debugging  deploy  deployinator  deployment  devops  disks  distros  docker  documentation  dropbox  dstat  ebs  ec2  encryption  etsy  false-positives  filesystems  firewall  flickr  foursquare  fs  git  gltail  grades  graphite  hacks  hardware  history  honeypot  hosting  howto  https  humor  icloud  infrastructure  inotify  integration  io  ioprofile  iostat  ip  iphone  iptables  java  kafka  kernel  libc  licensing  linkedin  linux  load-balancers  logfiles  logging  logs  lsyncd  lwn  maatkit  mac  mail  mandos  memory  metrics  mirroring  mongodb  mongrel  monitoring  mpstat  mta  mysql  nagios  nat  network  networking  nosql  npm  ntp  ntpd  oom  oom-killer  open-source  opengl  opensource  ops  optimization  osx  ouch  outage  packaging  packet  padraig-brady  pager  pages  performance  phones  postfix  proc  production  profiling  progress  progress-bar  provisioning  puppet  pv  pwnat  python  reference  replication  restore  restoring  rob-ewaschuk  root-cause  rsync  ruby  runbooks  rvm  s3  samy-kamkar  sar  scripting  security  server  servers  service-metrics  sharding  shell  shell-scripts  slew  smtp  spamassassin  spamd  sre  ssh  ssl  ssl-labs  statistics  stepping  storage  strace  stun  sync  synchronization  sysadmin  syslog  tail  tcp  tdd  testing  time  time-synchronization  tips  tls  tools  traversal  tricks  troubleshooting  tuning  tunneling  udp  ui  unit-tests  unix  use  vagrant  version-control  via:adulau  via:fanf  via:jzawodny  via:nelson  via:petermblair  via:reddit  virtualization  visualization  vm  vms  web  wiki  wtf  zed-shaw  zfs  zookeeper 

Copy this bookmark: