For most projects you work on database partitioning might not even be a requirement. It was the case for for most parts. And I have been writing code for some time now. One of the projects I was working on had a requirement where we wanted to be able to split the data across multiple tenants easily. And since we were already on postgresql, diving deeper into the topic gave me some good insights. This article is written purely for my reference. But if it helps someone else its a win đź’Ż
Intro
Database partitioning is a process that involves breaking down a large database table into smaller, more manageable parts. The best part is doing so helps increase performance since now, you don’t have to run your queries against a billion records and it will be on a subset of the dataset.
Two key reasons why we need partitioning are to simplify maintaining large data sets and to potentially increase query performance when I am only interested in a subset of the data.
PostgreSQL has really high hard limits, these give you a lot of wiggle room
Item | Upper Limit | Comment |
---|---|---|
database size | unlimited | |
number of databases | 4,294,950,911 | |
relations per database | 1,431,650,303 | |
relation size | 32 TB | with the default BLCKSZ of 8192 bytes |
rows per table | limited by the number of tuples that can fit onto 4,294,967,295 pages | |
columns per table | 1600 | further limited by tuple size fitting on a single page; see note below |
columns in a result set | 1664 | |
field size | 1 GB | |
identifier length | 63 bytes | can be increased by recompiling PostgreSQL |
indexes per table | unlimited | constrained by maximum relations per database |
columns per index | 32 | can be increased by recompiling PostgreSQL |
partition keys | 32 | can be increased by recompiling PostgreSQL |
Recently I had to look into optimizing a relation that would store transactional data in millions. We anticipated this relation would have exponential growth causing query performance problems in a few months.
While doing my research I was looking at sharding as an option, and since this project used PostgreSQL as the primary data store I was investigating what patterns are available for us.
This led me to a really great talk from 2021 on youtube Jimmy Angelakos
This gave me some great insights as to what partitioning is and how its implemented within Postgresql.
Declarative Partitioning
With PostgreSQL we declare that a table is divided into partitions. The DDL that creates the table will specify what type of partitioning method and the columns we will be looking at.
PostgreSQL has 3 types of partitioning.
-
Range Partitioning
-
List Partitioning
-
Hash Partitioning
The examples code snippets are from the official PostgreSQL documentation - https://www.postgresql.org/docs/15/ddl-partitioning.html
CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
) PARTITION BY RANGE (logdate);
CREATE TABLE measurement_y2006m02 PARTITION OF measurement
FOR VALUES FROM ('2006-02-01') TO ('2006-03-01');
CREATE TABLE measurement_y2006m03 PARTITION OF measurement
FOR VALUES FROM ('2006-03-01') TO ('2006-04-01');
In the above example, we are partitioning the measurement table by month. Range partitioning lets you partition data that falls between a “range” of values. This can be dates or a range of numbers city_id between 100 to 1000
Another popular mechanism is to partition by Hash or list values.
One big win from partitions is the ability to move them to a different tablespace or to detach them and drop them without the overhead of deleting millions of records with the DELETE
command which has a few side effect.
ALTER TABLE measurement DETACH PARTITION measurement_y2006m02;
DROP TABLE measurement_y2006m02;
/* This also means that we can add new partitions easily */
CREATE TABLE measurement_y2008m02 PARTITION OF measurement
FOR VALUES FROM ('2008-02-01') TO ('2008-03-01')
TABLESPACE fasttablespace;
However, we need to keep in mind the following limitations of partitioned tables:
-
Unique/primary key constraints cannot include expressions or function calls in partition keys and must include all partition key columns.
-
Exclusion constraints can only be applied to individual leaf partitions.
-
BEFORE ROW triggers on INSERT cannot change the destination partition for a new row.
-
Mixing temporary and permanent relations in the same partition tree is not allowed, and temporary relations must be within the same session.
Considerations before opting to use Declarative Partitioning:
-
Careful design is crucial to avoid negative performance effects on query planning and execution.
-
Choosing the partition key based on commonly used WHERE clauses can prune unneeded partitions.
-
Removal of unwanted data can be optimized by placing it in a single partition.
-
Choosing the target number of partitions is important to balance index size and query planning time.
-
Sub-partitioning and range partitioning with multiple columns can help with uneven data distribution.
-
Query planner can handle partition hierarchies with a few thousand partitions if pruning is effective.
-
Large number of partitions can result in longer planning times, higher memory consumption, and slower re-partitioning.
-
Data warehouse workloads may benefit from a larger number of partitions.
-
Simulations can help optimize partitioning strategy; never assume more partitions are better.
PostgreSQL uses Table inheritance in the background to implement partitioning. If we require more control over the partition key, where there are more constraints and rules around them inheritance-based partitioning works the best. We will explore that in a different article.