SQL databases (RDMBS) are versatile data storage platforms, but have historically become a scalability bottleneck due to their inability to scale across multiple servers. However, the theory of scaling out SQL is now well-understood and is starting to see successful implementations.
In this talk - based on his years of work on Postgres and at Citus Data working on the Citus distributed database - Marco shows one aspect of scaling a SQL database: Distributing the computation of queries across many servers.
The relational algebra underlying SQL can be extended to represent queries on distributed tables. Commutativity and distributivity rules can be used to optimise the execution plan of a query on a distributed table in order to minimise network traffic and maximise parallelism. The resulting relational algebra tree can be converted back into a set of SQL queries that can be executed on individual shards, in parallel, to give the end result. This means that a single-server SQL database can act as a building block to a distributed SQL database.
PostgreSQL is uniquely positioned for implementing this model because of its extension API, which allows key parts of its query planning and execution pipeline to be overridden. Through an extension, PostgreSQL can simultaneously acts as a distributed SQL engine, and as the underlying data storage and processing layer.
SQL databases allow many different applications to be developed with the same underlying data store, but they are often regarded as a scalability bottleneck due to their inability to scale across multiple servers. In response, a wide range of functionally-limited databases have been developed that scale out a specific data model or use case. This means businesses now have to deploy and operate many complex storage systems— and then also manage data across these different storage systems. If a SQL database could scale, it would make many such systems obsolete. Fortunately, scaling SQL is possible, it's just hard.
Scaling SQL requires us to revisit the relational algebra on which SQL is based and add the notion of distribution. By adding operators such as "Collect", we can construct a relational algebra tree that represents a distributed query plan. Through commutativity and distributivity rules, the relational algebra tree can be optimised by pushing down computational work to the computers that store the data, and limiting the amount of data that flows over the network. To execute the distributed query plan, a key observation is that the relation algebra tree can be transformed back into a set of local SQL queries that could be executed by a single-node SQL database.
Among open source databases, PostgreSQL is surprisingly and uniquely suited for scaling out SQL. It has pluggable interfaces for various stages of query planning and execution, which means that an extension like Citus can add distributed SQL logic to query across different servers, while using PostgreSQL's own logic to query, store, index and manage the data on disk. Combined with a large number of features and rich ecosystem of extensions, PostgreSQL is well-positioned to become a scalable, general-purpose data platform.
Further reading here.