The Two Metrics You Need
When interviewing candidates for Instacart’s first site reliability engineer, I volunteered to cover monitoring as one of my topics. I’d start by asking “What metrics should we be monitoring?”
One candidate gave an answer that astounded me. He said,
There are only two things I care about: errors and latency
- the sum of 5xx status codes
- latency across all requests - average or 95th percentile
Both must be measured at the load balancer. Errors include those generated by the application and by the load balancer.
Place alerts on these metrics to detect problems with the health of your site. It is significantly more effective than relying on services which monitor a few endpoints (you should do this as well).
Here’s how to get them on a few services.
CloudWatch gives them for free
- Errors = Sum ELB 5XXs + Sum HTTP 5XXs
- Latency = Average Latency
heroku addons:create librato:development
- Errors = Sum of
- Latency = Sum of