Behind the Scenes at Jungle Disk - Autoscaling SQS Workers
TL;DR
Checkout the code repo for an example terraform config that uses Ansible in the user-script to setup a worker that terminates itself.
Autoscale up
AWS autoscaling is an excellent tool for managing a dynamic pool of workers. For stateless work, i.e. serving a website, allowing autoscaling to expand and contract the worker pool works very well. However, for stateful work such as processing jobs off a queue, allowing autoscaling to contract the worker pool is undesirable since it may terminate a worker currently processing a job. Instead the workers themselves should manage removing themselves from the autoscaling group.
This gist shows a snippit that watches a queue and adds a worker when the queue contains at least one message for > 5 min.
Ansible on deploy
By embedding an Ansible playbook in the terraform config, instances that are created by the autoscaling policy can auto deploy themselves:
The instance_profile
uses the following IAM config:
NOTE See the code_repo for the ssh_keys auto deployment
The user_data.sh
script installs Ansible and copies down the playbook from
the s3 bucket. It then call Ansible locally to configure the instance.
The Ansible play gathers some facts from the local environment and configures a set of systemd services and scripts.
The wrapper.sh
script simply calls a worker program with a job_id. This
script is a good place to put any environment variables needed for the
worker program to function.
The queue-manager.sh
script long polls the SQS job queue and pulls out the
job_id
and passes it to the wrapper.sh
script. After the worker process
has exited it removes the job from the queue.
The seppuku.sh
script discovers its instance_id
from the metadata
service and then terminates itself via the autoscaling group decrementing
the capacity desired. This logic serves to “scale-in” the group when jobs are
scarce.
How the seppuku.sh
script gets triggered abuses a bit of systemd magic.
The seppuku.service
is enabled, but not started by Ansible. This means
that the next time the server boots up, it will start the seppuku.service
.
The worker.service
relies on the fact that the queue-manager.sh
script
always exits after one attempT to process the queue. It it setup to always
restart the service without delay, but if it restarts the service more than
the StartLimitBurst
value in StartLimitIntervalSec
seconds, it will
reboot the server, causing the seppuku.service
to run destroying the
instance and “scaling-in” the worker pool.
Since AWS charges for a full hour for a fractional hour, it is in the best interest of cost for the workers to work as much as possible. Therefore, the Ansible variables configure the service to restart up to 75 time in 50 mins. This leaves about 10 mins for the server to reboot and remove itself at the end of the first hour.