Subscribe to the Jungle Disk monthly blog digest:


Behind the Scenes at Jungle Disk - Object-based Storage and the Structure of Objects

by Jonathan Robertson / Behind the Scenes, Product, Technical / Mar 13, 2017 / Comments

This article discusses Jungle Disk’s approach to object naming/structuring and some of the reasons why we chose to approach it this way.

What do you mean by “object-based” storage?

We use object-based cloud storage for recording customer data and making it available on the Internet. The structure of this kind of storage is arranged in buckets or containers. Inside such a bucket, you’d have as many objects as you’d like to keep.

What do you mean by “objects”?

Objects are most similar to ‘files’ on a local file system and are composed of 3 parts:

  1. The object’s name (called a key by some providers) must uniquely represent the object due to there technically being no support for a nested hierarchical structure that would otherwise be provided by things like folders. Many storage providers have a standard suggested way for naming keys that can appear to simulate folder structures, however.

  2. Looking up an object by that name/key will allow you to download the data contained in that file.

  3. Objects also have a way to store/access metadata (like file size, last update time, and custom fields that you can specify in most providers).

So how does Jungle Disk do things in object-based storage?

We encountered a couple of issues early on that lead us in some interesting directions for structured object storage. We decided to have 2 types of objects: ‘dir’ and ‘file.’ This led us to the current structure that looks like this:

Format:		[parentGUID]/[thisGUID]/[type]/[name]/[size(bytes)]/[partNumber]/[metadata]
Example:	5ca3c457120881b629b15a3d85aecaa6/76a3c457a257b3d0b5af9b0d2db81aa9/file/screenshot.png/26769/0/mtime-1472504694-md5-52056e7aa1f9d1e89367aceafcb9ed36-wattr-32

Files & Directories (folders)

  • parentGUID: the globally unique ID of the containing folder (could be root)
  • thisGUID: the globally unique ID for this object
  • type: ‘file’ or ‘dir’, indicating if this is a directory or normal file
  • name: the original name representing the file or folder

Files Only

  • size: the size in bytes of the file’s data
  • partNumber: used for multi-part files
  • metadata: hold info such as modified time, os-provided attributes, and hash (used to help confirm if file data has changed)

What are the benefits of this?

Directory Listing

Storage providers offer a method to list only the objects matching a prefix. Placing our folder’s GUID at the start of an object’s key allows us to leverage this feature so we can limit what we’re listing to just that folder.

In other words, let’s say you have Folders A and B. Folder A has 10 million objects stored inside, while Folder B has 1,000. Receiving the list for your bucket would usually require you to load all 10,001,000 objects before continuing on… and transferring that much data through the Internet can take a while (several seconds, or maybe even a few minutes), this is a real bummer if we just want something from Folder B.

However, if we prefix our request with the Folder B’s GUID, we can limit the scope of our listing request to receive just those 1000 objects and probably see our object keys within a second or two.

Metadata as Part of the Key

Some storage providers don’t return the metadata in List operations. Since List is necessary for getting object keys, we have to wait on that to return before we can make requests to get each object’s metadata.

By storing the metadata, we care about the inside of the object’s key. We’re receiving all the information we need in the List command that we have to run anyway.

Separating the Name(key) and Data

This one is a little weird, but has some history to it. When uploading a file, we create a GUID for the file object. This gets loaded into the structure shown above ([parentGUID]/[thisGUID]/[type]...) and also loaded into FILES/[thisGUID/[partNumber].

Since renaming isn’t a feature of object storage, doing so used to require one to download the object, rename it locally, upload the object, and delete the old object by name. Oh, and because we have the parentGUID (ID for the containing folder) in the object’s key, renaming would actually be necessary for a file/folder move as well.

Unfortunately, this takes time (especially for larger files) and money (storage providers charge for outgoing bandwidth). Reducing the ‘name’ to an object of 0 size and using part of that object’s key to look up the object’s data (thisGUID), the cost in time and money of such a process could be reduced dramatically.

Nowadays, however, a better option exists: COPY. This provides a way to request object duplication on Amazon’s system without having to download/upload anything. With this approach, you’d issue a copy from one object to another (with the new key/name), then delete the old object. In the future, this is a change we’ll likely implement since the Copy feature is supported by nearly all storage providers now.

This isn’t the only solution.

Many design choices will involve trade-offs. For example, our approach would be a bad solution if we wanted to use the storage providers’ versioning systems. These systems depend on the key staying the same, but our approach generates a new key for every object change/update.

The great thing about Object-based storage is that it’s so easy to manipulate. You have a blank canvas for the structure of your key and you can configure it to meet your needs.

Jungle Disk Team