We recently introduced a new component - or rather set of components - to our infrastructure to improve our ability to monitor the operations of Way to Health. Prometheus and Grafana give us clear visual metrics for how things are working under the hood.
We recently introduced a new component - or rather set of components - to our infrastructure to improve our ability to monitor the operations of Way to Health. Prometheus and Grafana give us clear visual metrics for how things are working under the hood.
Before I dive into the details of these new tools, it’s worth giving a summary of some things it’s built upon.
Business logic in w2h runs in one of four contexts - synchronously in a web request, as a scheduled task, as a daemon, or via a queue.
As a quick example of each:
exportDataJob
onto a queue, which will get processed in a first-in-first-out fashion by a queue worker.That fourth category (of queue jobs) operates via beanstalkd using the Laravel queue component. Beanstalkd is a simple and rock solid queue manager. It doesn’t have all the fancy features of Amazon SQS or RabbitMQ, but it does the job and is simple to run.
Prior to introducing Grafana and Prometheus, our application monitoring centered primarily around health.json. Our application exposes a /health.json
endpoint which contains information about each of the underlying components of our system - backend microservices, scheduled tasks, daemons, queues, and so on. Each component has configured thresholds for pass/warn/fail - for example, if a task scheduled to run hourly hasn’t run in 1:30 it warns, and after 3 hours it fails. Our default
queue warns if it exceeds 1000 jobs waiting, or if jobs have been waiting for more than 15 minutes. These thresholds are chosen and adjusted manually and have some amount of false positives/negatives associated with them.
We run a monitoring tool (sensu) which checks that endpoint every 5 minutes and sends a message to slack if it returns a status of fail
.
This alerts us quickly to errors, but has a few limitations:
We chose prometheus because it focuses on quantitative timeseries data rather than forcing us too quickly into fail/warn/pass categorization. (And because it’s easy to run via docker.)
Both Grafana and Prometheus can do alerting. It seems like in general, the recommendation is to set up alerts within Grafana rather than Prometheus. To date (July 2021), we’re not using these for alerts yet, just for visualizing metrics.
Prometheus expects to get its metrics in a specific text-based format. By convention, it’s usually at http://some-hostname/metrics
, and the format needs to be something like the below. For apps that don’t speak this language by default, typically you use an “exporter” to pull data from the application and return it in this format.
Below is a shortened version of what’s returned by beanstalkd_exporter. In this, you see a mix of:
Current stats (e.g. current_jobs_ready) which you’d graph directly as a time series.
# HELP cmd_delete is the cumulative number of delete commands.
# TYPE cmd_delete gauge
cmd_delete{instance="beanstalkd:11300"} 888629
# HELP current_jobs_ready is the number of jobs in the ready queue.
# TYPE current_jobs_ready gauge
current_jobs_ready{instance="beanstalkd:11300"} 836
# HELP total_jobs is the cumulative count of jobs created.
# TYPE total_jobs gauge
total_jobs{instance="beanstalkd:11300"} 885380
# HELP tube_cmd_delete is the cumulative number of delete commands for this tub.
# TYPE tube_cmd_delete gauge
tube_cmd_delete{instance="beanstalkd:11300",tube="backfill"} 0
tube_cmd_delete{instance="beanstalkd:11300",tube="default"} 32390
tube_cmd_delete{instance="beanstalkd:11300",tube="events0"} 63914
tube_cmd_delete{instance="beanstalkd:11300",tube="events_progressive0"} 0
tube_cmd_delete{instance="beanstalkd:11300",tube="exports"} 0
tube_cmd_delete{instance="beanstalkd:11300",tube="high"} 13
tube_cmd_delete{instance="beanstalkd:11300",tube="low"} 16
tube_cmd_delete{instance="beanstalkd:11300",tube="sync"} 14938
tube_cmd_delete{instance="beanstalkd:11300",tube="sync_fitbit"} 51560
tube_cmd_delete{instance="beanstalkd:11300",tube="sync_fitbit2"} 52609
tube_cmd_delete{instance="beanstalkd:11300",tube="sync_fitbit3"} 51171
tube_cmd_delete{instance="beanstalkd:11300",tube="sync_fitbit4"} 49166
tube_cmd_delete{instance="beanstalkd:11300",tube="sync_fitbit5"} 51103
# HELP tube_current_jobs_ready is the number of jobs in the ready queue in this tube.
# TYPE tube_current_jobs_ready gauge
tube_current_jobs_ready{instance="beanstalkd:11300",tube="backfill"} 0
tube_current_jobs_ready{instance="beanstalkd:11300",tube="default"} 11
tube_current_jobs_ready{instance="beanstalkd:11300",tube="events0"} 744
...
Looking at the graph of what jobs wait in the queue, you see lots of fitbit jobs waiting, with some occasional spikes of events to be processed (usually on or shortly after the hour). Based on this, one might assume that these are the vast majority of the jobs going through our queue. The second graph below shows otherwise - the fitbit jobs are constantly being processed at a low rate, but there lots of default
jobs that don’t show up in the top graph at all.
Estimating job volume based on the first graph is like trying to estimate airport volume by taking photos of the check-in or security lines. You might conclude, “I almost never see any first class passengers waiting in line to check in - there must not be very many of them”. In truth, there are lots of first class passengers, but the first class check-in line is staffed to process those passengers right away. The “fitbit” jobs are analogous to the coach passengers who manage to get cleared just in time for the passengers for the next flight to arrive.