sre   933

« earlier    

Preliminary Analysis of the Site Reliability Engineer Survey
If the response takes too long to get to your phone, the system might as well be "unavailable":

'If a page takes too long to load a user will consider it to be unavailable. I realized after the fact the nuances of this were not considered in the phrasing of one of our questions. We asked “What service level indicators are most important for your services?” Three of the options were end-user response time, latency, and availability. I view availability as the system up or down, latency as delays before a response is generated and end-user response time as how long before the user received the information they wanted. If an error message appears or the page fails to load, an application is unavailable. If a page takes 10 seconds to load, it’s available but incredibly frustrating to use. For SREs availability means more than is a system up or down. If the response time or latency exceeds a certain threshold the application is considered unavailable.'
sre  monitoring  metrics  itmanagement  availability  SLAs  suveys 
6 days ago by cote
Monitoring SRE's Golden Signals
Lists out how to get the metrics from various systems and software.
sre  monitoring  metrics  itmanagement 
6 days ago by cote
How to Monitor the SRE Golden Signals
[Summary from the post of metrics to use:]

Rate — Request rate, in requests/sec
Errors — Error rate, in errors/sec
Latency — Response time, including queue/wait time, in milliseconds.
Saturation — How overloaded something is, which is related to utilization but more directly measured by things like queue depth (or sometimes concurrency). As a queue measurement, this becomes non-zero when you are saturated, often not much before. Usually a counter.
Utilization — How busy the resource or system is. Usually expressed 0–100% and most useful for predictions (as Saturation is probably more useful). Note we are not using the Utilization Law to get this (~Rate x Service Time / Workers), but instead looking for more familiar direct measurements.
devops  metrics  sre  monitoring 
6 days ago by cote
DevOps Chat: SRE w/ Stig Sorensen of Bloomberg -
Stig Sorensen of Bloomberg talks to us about how SRE is allowing Bloomberg to stay ahead in the news and information world.
devops  sre  bloomberg 
11 days ago by tekbuddha
How Complex Web Systems Fail — Part 1 – Production Ready – Medium
There’s this one paper that keeps popping up on my radar. I think it’s about time I give it the attention it deserves. I’m talking about How Complex Systems Fail by Richard Cook. This seminal paper…
chaos-engineering  systems  sre  webdev  distributed_systems 
11 days ago by sbellef

« earlier    

related tags

26_sre  @toread  architecture  article  articles  auth  availability  awesome  bestpractices  blog  blogs  bloomberg  book  books  bootcamp  cases  chaos-engineering  cloud  cloudnative  communication  comparison  cookpad  cre  culture  customer-reliability-engineering  database  datadog  debrief  debriefing  definition  development  devops  distributed-systems  distributed_systems  dns  ebook  ebooks  eks  engineering  error-budget  essay  estimation  etsy  facebook  facilitation  filetype:pdf  fs  google  guide  guides  handbook  hcb  ifttt  important  information  inode  interview  itmanagement  kubernates  kubernetes  learned  learntech  lessons  log  logging  logs  m&a  machinelearning  management  metrics  monitoring  netflix  networking  observability  online  online_book  operations  ops  ovum  pe  performance  platforms  pocket  podcast  postgresql  productivity  programming  quality  rca  reference  reliability  remote  risk  scalability  server  sitereliabilityworkbook  slas  slo  softwaremanagement  splunk  status  suveys  sysadmin  systemdesign  systems  team  teams  technology  thought-provoking  tools  video  videos  watchlater  webdev   

Copy this bookmark: