Behind the Scenes at Jungle Disk - Autoscaling SQS Workers
Checkout the code repo for an example terraform config that uses Ansible in the user-script to setup a worker that terminates itself.
AWS autoscaling is an excellent tool for managing a dynamic pool of workers. For stateless work, i.e. serving a website, allowing autoscaling to expand and contract the worker pool works very well. However, for stateful work such as processing jobs off a queue, allowing autoscaling to contract the worker pool is undesirable since it may terminate a worker currently processing a job. Instead the workers themselves should manage removing themselves from the autoscaling group.
This gist shows a snippit that watches a queue and adds a worker when the queue contains at least one message for > 5 min.
Ansible on deploy
By embedding an Ansible playbook in the terraform config, instances that are created by the autoscaling policy can auto deploy themselves:
instance_profile uses the following IAM config:
NOTE See the code_repo for the ssh_keys auto deployment
user_data.sh script installs Ansible and copies down the playbook from
the s3 bucket. It then call Ansible locally to configure the instance.
The Ansible play gathers some facts from the local environment and configures a set of systemd services and scripts.
wrapper.sh script simply calls a worker program with a job_id. This
script is a good place to put any environment variables needed for the
worker program to function.
queue-manager.sh script long polls the SQS job queue and pulls out the
job_id and passes it to the
wrapper.sh script. After the worker process
has exited it removes the job from the queue.
seppuku.sh script discovers its
instance_id from the metadata
service and then terminates itself via the autoscaling group decrementing
the capacity desired. This logic serves to “scale-in” the group when jobs are
seppuku.sh script gets triggered abuses a bit of systemd magic.
seppuku.service is enabled, but not started by Ansible. This means
that the next time the server boots up, it will start the
worker.service relies on the fact that the
always exits after one attempT to process the queue. It it setup to always
restart the service without delay, but if it restarts the service more than
StartLimitBurst value in
StartLimitIntervalSec seconds, it will
reboot the server, causing the
seppuku.service to run destroying the
instance and “scaling-in” the worker pool.
Since AWS charges for a full hour for a fractional hour, it is in the best interest of cost for the workers to work as much as possible. Therefore, the Ansible variables configure the service to restart up to 75 time in 50 mins. This leaves about 10 mins for the server to reboot and remove itself at the end of the first hour.