Sharding in Distributed Computing

First, you have to understand what a distributed database system actually means.

There are several types of “distribution”:

  • Every instance is a clone of every other, with only minimal inconsistency tolerated during broadcasting of updates to all the instances. This is how Blockchains are implemented, as well as a few other types of distributed db’s. In terms of storage, the main purpose of this type of distribution is redundancy and optimizing read performance in the case of widely distributed db’s such as blockchain, as you can read an instance “near you” in terms of networking.
  • Each instance, or each instance group (more typically the latter), has a distinct set of data. Instances within groups may be clones of each other, or have data-slices that are in common in a sort of RAID-ish fashion (ie, Cassandra rings), but in most types of distribution, a particular piece of data may reside in only one group.
  • There are a few others as well, but they’re less common than the above.

If you have any sort of distribution where each instance or group isn’t a complete copy, you need some sort of rules for sending a particular piece of data to a particular storage pool in your distributed world.

This is where sharding comes in.

You can have sharding that’s basically hidden from applications, although it’s never completely hidden from schema designers, as the choice of shard key has rather profound impacts on overall performance (even in NoSQL distributed worlds).

Some relational DB’s have tooling that to an extent hides sharding from applications, or have the ability to use SQL “proxies” that implement sharding rules. But the worlds I’ve seen use policy or application-level sharding, particularly for relational DB worlds, such as the several hundred MySQL shard stacks we have at Fitbit.

Policy sharding would be something like using “Mod(UserID, ShardCount)” to get a “ShardID”, which would map to a particular instance using some type of lookup table. Or something as simple as having every enterprise customer’s data in a separate DB instance stack, so you have a world that is de facto sharded by Customer. You can get arbitrarily complex with this approach, but the main idea is that shard awareness isn’t at the DB level.

Author: Aditya Bhuyan

I am an IT Professional with close to two decades of experience. I mostly work in open source application development and cloud technologies. I have expertise in Java, Spring and Cloud Foundry.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s