Behind the Scenes at Jungle Disk - Gateway Review
By: Jonathan Robertson
Back in October 2016, we had high expectations for the redesign of our Gateway infrastructure (originally discussed here).
The lab tests were looking good, and the updated design was solid, but the true test for any software service is how it performs over an extended period of time in production.
Here’s a breakdown of the before & after for design, stats, and interesting info over the last 6 months.
- C# (.Net 2.0)
- Client state (information, connectivity, etc) was stored in individual servers
- All clients in an account must be connected to the same server in order to communicate with each other, which results in unbalanced load
- Which server a client connected to was determined by a math formula based on some account information and the number of Gateway servers online
- Golang 1.7
- Client state stored in Redis cluster; each server communicates with the other via Redis pub/sub
- Load balancing performed by an AWS Application Load Balancer
- Clients in the same account can be connected to any server and still maintain communication with each other
- Which server a client connects to is balanced between servers based on load
- 15 server pairs (Linux server managed SSL connection due to issue with language/libraries in Windows code, while the Windows server managing remaining responsibilities)
- ‘Acceptable’ load up to 15,000 clients (usually crashed within 1 month), much more unstable as approaching 20,000 (usually crashed within a week)
- Scaling up the number of servers required a modification to a state file and restart of every customer’s Jungle Disk service
- 3 Linux servers (+ 1 small Linux server running a lightweight manager service), Redis cluster (3 instances) to manage state, and an Application Load Balancer in AWS
- Rated as fully stable (no expectation of crashes) up to 35,000 clients (in the lab) for each server in environment, though likely works without issue far beyond this number. Lab clients submitted maxed out message sizes nonstop and would be considered as more active than any normal Jungle Disk client could ever be
- Scaling up the number of servers is transparent to users and can be done at any time with zero risk of client disconnection
- When overloaded, old Gateways would crash
- If an old Gateway crashed, all clients with an account ‘assigned’ to that gateway would lose connection and the Jungle Disk service would need to be restarted manually from the users’ computers even after the old Gateway was brought back online
- At least one Gateway crashed each month, though sometimes we had several Gateways crash in a single month
- In lab testing of new Gateways, being overloaded would prevent the Server List from populating in the Jungle Disk Server Edition Management Client software, but would not result in a crash or loss of connection to existing clients
- If a new Gateway actually crashes, clients will be automatically rebalanced to another server within a minute or two without any interaction from the users’ side
- We haven’t had a single new Gateway server crash since we deployed them to production (6 months at this point)
At the end of the day, this change has resulted in a faster, more reliable Gateway services with 100% uptime for our customers. It’s performed as well or better than our expectations and we’re incredibly happy with the outcome.