Improving SQL Scalability: Understanding Data Sharding

6 min readJul 21, 2021

Improving SQL Scalability: Understanding Data Sharding — CodeStore Technologies

Introduction:-

Business organizations of all sizes are embracing rapid modernization of user-facing app as a part of their broader digital transformation strategy. The relational database infrastructure on which such apps rely suddenly needs to support much larger data sizes and transaction volumes. However, a monolithic RDBMS tends to quickly get overloaded in such scenarios.

One of the most common architectural patterns used to scale an RDBMS is to ‘Shard” the data. This blog will gain brief knowledge about what is sharding and how it can be used to scale SQL databases. We will also review the pros and cons of common sharding architectures, we will also explore how sharding is implemented in distributed SQL databases.

What is Data Sharding?

Sharding is the process of breaking up large tables into smaller chunks called shards that are spread across multiple servers. A shard is essentially a horizontal data partition that contains a subset of the total data set and hence is responsible for serving a portion of the overall workload.

The idea is to distribute data that can’t fit on a single node onto a cluster of database nodes. Sharding is also referred to as horizontal partitioning. The distinction between horizontal and vertical comes from the basic tabular view of a database. A database can be either split vertically — strong different table columns in a separate database, or horizontally — strong rows of the same table in multiple database nodes.

Why Sharding a Database?

Business applications that rely on a monolithic RDBMS hit bottlenecks as they grow. With limited CPU, storage capacity, and memory, query throughput and response times are bound to suffer. When it comes to adding resources to support database operations, vertical scaling has its own set of limits and eventually reaches a point of diminishing returns.

On the other hand, horizontally partitioning a table simply means computing the capacity to serve incoming queries, and therefore you end up with faster query response times and index builds. By continuously balancing the load and data set over additional nodes, sharding also enables the usage of additional capacity. Moreover, a network of smaller, cheaper servers may be more cost-effective in the long term than maintaining one big server.

Besides resolving scaling challenges, sharding can potentially alleviate the impact of unplanned outages. All the data in an unsharded database is inaccessible during downtime, which can be disruptive or downright disastrous. When done right, sharding can ensure high availability: even if one or two nodes hosting a few shards are down, the remaining database is still available for reading/writing operations as long as the other nodes run in different unsuccessful domains. Overall, sharding can increase the total cluster storage capacity, speed up processing, and offer higher availability at a lower cost than vertical scaling.

The Perils Of Manual Sharding:-

Sharding, including the day-1 shard creation and day-2 shard rebalancing, when completely automated can be a boon to high-volume data apps. Unfortunately, monolithic databases like Oracle, PostgreSQL, MySQL, and even newer distributed SQL databases like Amazon Aurora also do not support automated sharding.

This means manual sharding at the application layer has to be performed if the user wants to continue to use these databases. The total result is a massive increase in development complexity. The application now has additional sharding logic to know exactly how the data is distributed, and what are the techniques to fetch it. The user also has to decide what sharding approach needs to be adopted, how many shards need to be created, and how many nodes to use. This also helps in creating accounts for shard key as well as even sharding approach changes if your business needs change.

One of the most significant challenges with manual sharding is uneven shard allocation. Disproportionate distribution of data could cause shards to become unbalanced, with some overloaded while others remain relatively empty. It’s best to avoid accruing too much data on a shard, as a hotspot can lead to slowdowns and server crashes.

This problem could also arise from a small shard set, which forces data to be spread across too few shards. This is pretty much acceptable in the development and testing environments, but when it comes to production, uneven data distribution, hotspot, and storing data on too few shards can all cause shard and server resources exhausted.

Finally, manual sharding can complicate operational processes. Backups will now have to be performed for multiple servers separately. Data migration and schema changes must be done carefully and with proper coordination to ensure all shards have the same schema copy. Without enough optimization, database joins across multiple servers could be highly inefficient and difficult to perform.

Common Auto-sharding Architectures:-

Sharding has been around for a long-time and over the years different sharding architectures and implementations have been used to build large-scale systems. In this part, we will go over and know three of the most common auto-sharding architectures.

Hash Sharding:-

Hash sharding takes a shard key’s value and generates a hash value from it. The hash value is then used to determine in which shard the data should reside. With the help of a uniform hashing algorithm, the hash function can evenly distribute data across servers, reducing the risk of hotspots. With this approach, data with close shard keys are unlikely to be placed on the same shard. This architecture is thus great for targeted data operations.

Range Sharding:-

Range sharding divides data based on ranges of the data value. Shard keys with nearby values are more likely to fall into the same range and onto the same shards. Each shard essentially preserves the same schema from the original database. Sharding becomes as easy as identifying the data’s appropriate range and placing it on the corresponding shard.

Range sharding allows for efficient queries that read target data within a contiguous range or range queries. Without proper shard key selections, data could be unevenly distributed across shards, and specific data could be queried more compared to the others, creating potential system bottlenecks in the shards that get a heavier workload.

The ideal solution to uneven shard sizes is to perform automatic shard splitting and merging. If the shard becomes too large or hosts a frequently accessed row, then breaking the shard into multiple shards and then rebalancing them across all the available nodes provides better performance. Similarly, the opposite process can be undertaken where there are too many small shards.

Geo — Partitioning:-

In geo-based sharding, data is first partitioned according to a user-specified column that maps range shards to specific regions and the node in those regions. Inside a given region, data is then sharded using either hash or range sharding. Taking an example, a cluster that runs across 3 regions in the US, UK, and the EU can rely on the Country_Code column of the User table to map the user’s row to the nearest region that is in conformance with all GDPR rules.

Final words:-

Data sharding is simply a solution for business applications with large data sets and scale needs. There are various sharding architectures to choose from and each of them has different capabilities. Before choosing and setting a sharding architecture, the organization must jot down its needs and workload requirements for its applications.

Manual sharding should be avoided in most circumstances given a significant increase in the application’s logical complexity. CodeStore Technologies provides the best and the most efficient auto-shard distributed SQL database support and management solution to clients all over the globe.

If you have any requirements related to data sharding then feel free to share them at sales@codestoresolutions.com.