First, you have to understand what a distributed database system actually means.
There are several types of “distribution”:
- Every instance is a clone of every other, with only minimal inconsistency tolerated during broadcasting of updates to all the instances. This is how are implemented, as well as a few other types of distributed db’s. In terms of storage, the main purpose of this type of distribution is redundancy and optimizing read performance in the case of widely distributed db’s such as blockchain, as you can read an instance “near you” in terms of networking.
- Each instance, or each instance group (more typically the latter), has a distinct set of data. Instances within groups may be clones of each other, or have data-slices that are in common in a sort of -ish fashion (ie, Cassandra rings), but in most types of distribution, a particular piece of data may reside in only one group.
- There are a few others as well, but they’re less common than the above.
If you have any sort of distribution where each instance or group isn’t a complete copy, you need some sort of rules for sending a particular piece of data to a particular storage pool in your distributed world.
This is where sharding comes in.
You can have sharding that’s basically hidden from applications, although it’s never completely hidden from schema designers, as the choice of shard key has rather profound impacts on overall performance (even in NoSQL distributed worlds).
Some relational DB’s have tooling that to an extent hides sharding from applications, or have the ability to use SQL “proxies” that implement sharding rules. But the worlds I’ve seen use policy or application-level sharding, particularly for relational DB worlds, such as the several hundred MySQL shard stacks we have at Fitbit.
Policy sharding would be something like using “Mod(UserID, ShardCount)” to get a “ShardID”, which would map to a particular instance using some type of lookup table. Or something as simple as having every enterprise customer’s data in a separate DB instance stack, so you have a world that is de facto sharded by Customer. You can get arbitrarily complex with this approach, but the main idea is that shard awareness isn’t at the DB level.