Behind the Scenes at Jungle Disk: Object-based Storage
This article discusses Jungle Disk’s approach to object naming/structuring and some of the reasons why we chose to approach it this way.
What do you mean by “object-based” storage?
We use object-based cloud storage for recording customer data and making it available on the Internet. The structure of this kind of storage is arranged in
containers. Inside such a bucket, you’d have as many objects as you’d like to keep.
What do you mean by “objects”?
Objects are most similar to ‘files’ on a local file system and are composed of 3 parts:
The object’s name (called a
keyby some providers) must uniquely represent the object due to there technically being no support for a nested hierarchical structure that would otherwise be provided by things like folders. Many storage providers have a standard suggested way for naming keys that can appear to simulate folder structures, however.
Looking up an object by that name/key will allow you to download the
datacontained in that file.
Objects also have a way to store/access
metadata(like file size, last update time, and custom fields that you can specify in most providers).
So how does Jungle Disk do things in object-based storage?
We encountered a couple of issues early on that lead us in some interesting directions for structured object storage. We decided to have 2 types of objects: ‘dir’ and ‘file.’ This led us to the current structure that looks like this:
Format: [parentGUID]/[thisGUID]/[type]/[name]/[size(bytes)]/[partNumber]/[metadata] Example: 5ca3c457120881b629b15a3d85aecaa6/76a3c457a257b3d0b5af9b0d2db81aa9/file/screenshot.png/26769/0/mtime-1472504694-md5-52056e7aa1f9d1e89367aceafcb9ed36-wattr-32
Files & Directories (folders)
parentGUID: the globally unique ID of the containing folder (could be
thisGUID: the globally unique ID for this object
type: ‘file’ or ‘dir’, indicating if this is a directory or normal file
name: the original name representing the file or folder
size: the size in bytes of the file’s data
partNumber: used for multi-part files
metadata: hold info such as modified time, os-provided attributes, and hash (used to help confirm if file data has changed)
What are the benefits of this?
Storage providers offer a method to list only the objects matching a prefix. Placing our folder’s GUID at the start of an object’s key allows us to leverage this feature so we can limit what we’re listing to just that folder.
In other words, let’s say you have Folders A and B. Folder A has 10 million objects stored inside, while Folder B has 1,000. Receiving the list for your bucket would usually require you to load all 10,001,000 objects before continuing on… and transferring that much data through the Internet can take a while (several seconds, or maybe even a few minutes), this is a real bummer if we just want something from Folder B.
However, if we prefix our request with the Folder B’s GUID, we can limit the scope of our listing request to receive just those 1000 objects and probably see our object keys within a second or two.
Metadata as Part of the Key
Some storage providers don’t return the metadata in List operations. Since List is necessary for getting object keys, we have to wait on that to return before we can make requests to get each object’s metadata.
By storing the metadata, we care about the inside of the object’s key. We’re receiving all the information we need in the List command that we have to run anyway.
Separating the Name(key) and Data
This one is a little weird, but has some history to it. When uploading a file, we create a GUID for the file object. This gets loaded into the structure shown above (
[parentGUID]/[thisGUID]/[type]...) and also loaded into
Since renaming isn’t a feature of object storage, doing so used to require one to download the object, rename it locally, upload the object, and delete the old object by name. Oh, and because we have the
parentGUID (ID for the containing folder) in the object’s key, renaming would actually be necessary for a file/folder move as well.
Unfortunately, this takes time (especially for larger files) and money (storage providers charge for outgoing bandwidth). Reducing the ‘name’ to an object of 0 size and using part of that object’s key to look up the object’s data (
thisGUID), the cost in time and money of such a process could be reduced dramatically.
Nowadays, however, a better option exists: COPY. This provides a way to request object duplication on Amazon’s system without having to download/upload anything. With this approach, you’d issue a copy from one object to another (with the new key/name), then delete the old object. In the future, this is a change we’ll likely implement since the Copy feature is supported by nearly all storage providers now.
This isn’t the only solution.
Many design choices will involve trade-offs. For example, our approach would be a bad solution if we wanted to use the storage providers’ versioning systems. These systems depend on the key staying the same, but our approach generates a new key for every object change/update.
The great thing about Object-based storage is that it’s so easy to manipulate. You have a blank canvas for the structure of your key and you can configure it to meet your needs.