Behind the Scenes at Jungle Disk - Autoscaling SQS Workers


Checkout the code repo for an example terraform config that uses Ansible in the user-script to setup a worker that terminates itself.

Autoscale up

AWS autoscaling is an excellent tool for managing a dynamic pool of workers. For stateless work, i.e. serving a website, allowing autoscaling to expand and contract the worker pool works very well. However, for stateful work such as processing jobs off a queue, allowing autoscaling to contract the worker pool is undesirable since it may terminate a worker currently processing a job. Instead the workers themselves should manage removing themselves from the autoscaling group.

This gist shows a snippit that watches a queue and adds a worker when the queue contains at least one message for > 5 min.

Ansible on deploy

By embedding an Ansible playbook in the terraform config, instances that are created by the autoscaling policy can auto deploy themselves:

The instance_profile uses the following IAM config:

NOTE See the code_repo for the ssh_keys auto deployment

The script installs Ansible and copies down the playbook from the s3 bucket. It then call Ansible locally to configure the instance.

The Ansible play gathers some facts from the local environment and configures a set of systemd services and scripts.

The script simply calls a worker program with a job_id. This script is a good place to put any environment variables needed for the worker program to function.

The script long polls the SQS job queue and pulls out the job_id and passes it to the script. After the worker process has exited it removes the job from the queue.

The script discovers its instance_id from the metadata service and then terminates itself via the autoscaling group decrementing the capacity desired. This logic serves to “scale-in” the group when jobs are scarce.

How the script gets triggered abuses a bit of systemd magic.

The seppuku.service is enabled, but not started by Ansible. This means that the next time the server boots up, it will start the seppuku.service.

The worker.service relies on the fact that the script always exits after one attempT to process the queue. It it setup to always restart the service without delay, but if it restarts the service more than the StartLimitBurst value in StartLimitIntervalSec seconds, it will reboot the server, causing the seppuku.service to run destroying the instance and “scaling-in” the worker pool.

Since AWS charges for a full hour for a fractional hour, it is in the best interest of cost for the workers to work as much as possible. Therefore, the Ansible variables configure the service to restart up to 75 time in 50 mins. This leaves about 10 mins for the server to reboot and remove itself at the end of the first hour.

Protect Your Business Data

We are passionate about helping our customers protect their data. We want you to use Jungle Disk to protect yours. Click on Sign Up to get started. It takes less than 5 minutes!

Sign Up