As data enthusiasts, we keep a close eye on new developments in the database world.DuckDB has caught our attention due to its growing popularity for analytics use cases.
In this post, we make an attempt to understand DuckDB, the use cases it is built for, and the advantages it has over existing solutions. We also explore use cases where we could achieve much more with DuckDB when we plug it into a Data API Platform like Hasura.
Let’s quickly go over the what and why of DuckDB before delving into the API explorations.
What is DuckDB
DuckDB is a free, open source, embedded database management system optimized for data analytics and online analytical processing (OLAP). It supports standard SQL to run queries, aggregations, joins, and other SQL functions on the data. The data can be imported from JSON, Parquet, and CSV files.
DuckDB has a number of extensions available for use. The extensions allow reading data from database systems like PostgreSQL and SQLite. Like SQLite, it's a simple, file-based database so there's no separate server installation required. You just include the library in your application. DuckDB specifically caters to users from the analytical processing niche, where a full-fledged database would be overkill.
Why DuckDB
DuckDB is a columnar database. The data is organized by field, keeping all the data associated with a field next to each other. This structure helps since most OLAP queries typically access only a subset of the column but a large number of rows for those columns.
It utilizes vectorized data processing. Large batches of values are processed in a single operation. Vectorized execution leads to better performance in analytical calculations, which are often performed over all values in a particular column.
It uses multi-version concurrency control to become ACID compliant. This allows for modifying tables in parallel to running analytical queries.
“While DuckDB is first in a new class of data management systems, none of DuckDB’s components is revolutionary in its own regard. Instead, we combined methods and algorithms from the state of the art that were best suited for our use cases.”
SQLite for OLAP: Popular use cases
We can immediately relate to the statement that mentioned DuckDB as “SQLite for OLAP,” catering to analytical use cases. It is gaining popularity among data scientists and analysts. As we dig deeper to find what’s making DuckDB popular over other solutions, it has a few quick wins from developer experience and ease of setup and usage.
Some of the quick wins:
You can now do local analysis on large datasets. This is preferred over setting up libraries with other solutions due to SQL familiarity and query engine optimizations.
Analytics in application servers to serve dashboards, by virtue of it being an embedded OLAP system.
With DuckDB being portable, you can work with up to a few GBs (like 100GB range) of data locally, mash up data from different sources and run SQL queries for last mile analytics.
Familiar SQL syntax and a potential data API* (We are exploring a data API with Hasura).
Cost savings by running locally. Running OLAP queries on cloud warehouses are relatively expensive in the long run.
Data imports and mashup from different sources like CSV, Parquet, and PostgreSQL
DuckDB has the ability to mash up data from different sources like CSV / Parquet along with support for dialects like PostgreSQL among other sources. Depending on the use case and workload, you can leverage DuckDB’s memory usage and spilling to disk.
If most of your workload is aggregations, you can run it without even creating the tables because DuckDB is capable of directly querying CSV and Parquet files. For example, you can issue a query like SELECT val1, sum(val2) FROM read_csv_auto('my_csv_file.csv') GROUP BY val 1 on 100GB+ CSV files without having a lot of memory.
If your workload is join-heavy, then the more memory you have, the better, but joins can also spill on disk.
The case for an API layer on top of DuckDB
DuckDB has client libraries in popular languages to connect and execute the SQL queries directly.
Question: Why not directly execute SQL against DuckDB?
Answer: Well, you totally should execute directly for use cases where it is set up locally and fun to experiment, learn, and play with SQL for OLAP queries, as well as quick data mashup workloads.
As fun as it sounded so far to play around and experiment with, it becomes trickier for a few real-world use cases. Let’s delve into some of them.
Team collaboration on OLAP data
If you have multiple users collaborating on a huge data set, then of course it makes sense to use a cloud warehouse. Not everyone can or wants to download data sets into their local machine to set up DuckDB with associated network / bandwidth costs and the time taken.
Also, in the many-terabyte range, DuckDB may not be practical anymore as it's limited to the resources of a single machine. This is where a cloud serverless platform like MotherDuck helps. It is in early preview right now, but once this is launched for everyone to try out, this will be a no-brainer for collaboration workflows. With DuckDB on the cloud, an API layer makes more sense for collaboration.
Building real-world dashboards requires an API
DuckDB has been optimized for running OLAP queries and data analysis. There are use cases where you need dashboards / data visualizations against this data store. Building dashboards on the frontend and sharing the dashboard with a team of analysts wanting to look at the data requires building an API layer.
There is support for WASM which enables executing inside browsers via JavaScript. Calling SQL directly from the frontend totally works fine as long as the use case is local and there is a single user. Beyond that, you would be exposing connection string/credentials and it makes sense to work on an API.
Once you have an API layer, there are general concerns about an API that needs to be addressed. Some of them are authentication and authorization, which we will talk about next, and production-ready features like controlling execution time, query whitelisting, avoiding the obvious security concerns like SQL injection, etc. We will go over these concerns later.
Authentication and authorization
DuckDB doesn’t have a native authorization system and they shouldn’t worry about it either apart from the authentication stuff that solutions like MotherDuck integrates with.
Going back to the dashboards use case, different users collaborating require authorization depending on their access control rules configured for it. In the API layer, you would have to account for role-based access control (RBAC) rules. This is where a tool like Hasura can quickly bridge the gap with built-in declarative authorization.
Data mesh and federation
As we know, DuckDB helps mash up data from multiple sources like files and databases. There are data sources for which you need pipelines to import into DuckDB.
For example, if you are looking to import Snowflake or BigQuery data into DuckDB, you need to set up ingestion pipelines. But what if you need to just combine some metadata from another data source for reporting dashboards? And what if some of the data is not for OLAP but more metadata from an OLTP source or an external API?
A federated API layer would help in this use case. Imagine if you can combine the API capabilities of Hasura over DuckDB + MySQL + Oracle + Stripe API + <name whatever source you have>.
Running DuckDB workloads on edge environments
Lurking around in the MotherDuck Slack, we found a use case where DuckDB queries were required to be run on edge environments like Vercel Edge Functions and Cloudflare Workers with TypeScript. Currently, there are workarounds to set it up using the CLI but not all edge environments support this workflow.
Hasura’s Data Delivery Network connected to a cloud instance of DuckDB (hosted on MotherDuck) would expose an API (http interface) that can be used to execute queries on any edge environments natively.
Of course, it is clear why a HTTP interface is required, so what is stopping you from building one yourself? Well, the idea is to not spend time writing boilerplate CRUD APIs and working on authorization rules when they are all available declaratively through platforms like Hasura.
Going back to the dashboard use case, I would rather spend some time setting up the frontend to create tables/visualizations than spending time writing an API which solves filtering, permissions, pagination, and handle date times.
API Architecture: Hasura + DuckDB on MotherDuck
Conclusion
As we write this, we are working on a data connector for DuckDB so that you can instantly query using GraphQL / REST without writing the boilerplate code, worrying about authorization, and running this whole system for a production use case.
Do reach out to us for collaborating on building the connector or to get an early access to the connector by creating a GitHub issue here.