The Two Metrics You Need
When interviewing candidates for Instacart’s first site reliability engineer, I volunteered to cover monitoring as one of my topics. I’d start by asking “What metrics should we be monitoring?”
One candidate gave an answer that astounded me. He said,
There are only two things I care about: errors and latency
More specifically:
- the sum of 5xx status codes
- latency across all requests - average or 95th percentile
Both must be measured at the load balancer. Errors include those generated by the application and by the load balancer.
Place alerts on these metrics to detect problems with the health of your site. It is significantly more effective than relying on services which monitor a few endpoints (you should do this as well).
Here’s how to get them on a few services.
Amazon ELB
CloudWatch gives them for free
- Errors = Sum ELB 5XXs + Sum HTTP 5XXs
- Latency = Average Latency
Heroku
Add Librato.
heroku addons:create librato:development
- Errors = Sum of
router.status.5xx
- Latency = Sum of
router.service.perc95
androuter.connect.perc95