Why Use DuckDB?
#2: Data Pipeline Summer- How get started with DuckDB and when to use MotherDuck
This week, we are focused on databases, data warehouses, and their role in the data pipeline. Where your data is stored is where all of the magic happens. Without storage and compute, you can’t ingest data, and you definitely can’t transform it.
Throughout this challenge, DuckDB will be acting as our data warehouse. It is the analytics engine and database we will be using to execute and store our data models. DuckDB has become more and more popular due to its performance and cost-saving abilities.
If you’ve ever started a data pipeline project in Snowflake and forgotten that your pipeline was running without any sort of automatic resource monitor… You probably feel my pain. That’s $2,000 I will never get back 🙃
We won’t need to worry about that with DuckDB because it is completely free!
This week’s Data Pipeline Summer challenge is brought to you by MotherDuck! Not only do they throw the best conference parties (sorry Snowflake!), but they also offer a cloud data warehouse that allows you to scale your data without running up costly bills.
They utilize the power of DuckDB (which you will read below) while making it so you have proper production-ready analytics capabilities. MotherDuck uses DuckDB to create separate, perfectly-sizes instances for each user on your team. Where DuckDB falls short, MotherDuck makes up for.
Why DuckDB?
DuckDB is an open-source database, originally created as a research project in the Netherlands, with over 10 million monthly downloads on PyPI. Traditional databases separate your application and database into separate processes. DuckDB runs as a database directly within your application, making it super fast for local development.
This gives it four major advantages:
speed
simplicity
efficiency
easy deployment
DuckDB uses columnar storage and is optimized for READ operations, allowing it to process large-volume analytical queries, all from a single machine.
Many companies are finding unique uses for it, such as:
offloading non-critical queries to DuckDB to offset costs
increasing speed of querying large datasets
powering interactive data apps
processing security logs
running SQLMesh models and tests locally before running them in production
But, most importantly for you, it makes it easy to develop local analytics projects! No enterprise data warehouse needed.
What DuckDB is NOT Good For
Scaling DuckDB is absolutely possible, but requires engineering effort and architectural forethought. You can’t necessarily think of it as an automatic scaling solution, but instead need to design for this specific use case in mind.
DuckDB does not support multiple processes writing data at the same time. This, in turn, leads to something that won’t work for large teams and is not reliable for business use cases. This is where MotherDuck comes in.
MotherDuck is a data warehouse powered by DuckDB that solves for all of DuckDB’s limits when it comes to analytics for data teams. It allows for multi-user data access, cloud scalability without the headache of managing the infrastructure, and team collaboration.
MotherDuck offers all of the performance and cost-saving benefits of DuckDB while still allowing you to work as a team and push data to a production environment.
Getting started with DuckDB
DuckDB has multiple different clients that you can use to install it, but Python is the most popular.
You can follow installation instructions for using DuckDB here.
To install DuckDB using the python client, you can run the following:
pip install duckdb
To launch DuckDB in the CLI, simply call the duckdb
command.
From here, you can use SQL to copy data, create data tables, and read from it.
Persisting your data
If you don’t persist your data, you will lose the tables you’ve created in DuckDB each time you start a new session. You can do this by attaching a database in the case that DuckDB is already running:
ATTACH DATABASE '/Users/madison/sqlmesh/trail_trekker.db' AS trail_trekker_db;
Now you’ve created a persistent database called trail_treker_db
! Make sure the file you specify has an extension of .db
, .duckdb
, or .ddb
.
Creating data tables
DuckDB reads data based on file extensions. Even though it offers a specific function for reading from CSVs, you don’t need to use it for default parameters. You simply need to specify the path to the file you wish to populate the table with.
This function infers delimiter and datatypes for you, but you can also specify those as needed. To do this, you can use the read_csv_auto()
command.
-- read from CSV with default params
CREATE TABLE features AS SELECT * FROM '/Users/madison/Downloads/data_pipeline_summer/features.csv';
-- read from CSV and specify params
-- not needed for this example, but showing you how to use this
SELECT *
FROM read_csv('/Users/madison/Downloads/data_pipeline_summer/features.csv'),
delim = ',',
header = true,
columns = {
'feature_id': 'VARCHAR',
'feature_name': 'VARCHAR',
'feature_description': 'VARCHAR',
'feature_category': 'VARCHAR'
});
While this CSV file exists locally, you can also use the same logic for files that are hosted on the internet. This command creates a table by reading a publicly hosted CSV file from GitHub. This is what you will want to use for the challenge to create this data locally using DuckDB.
CREATE TABLE plans AS SELECT * FROM 'https://raw.githubusercontent.com/schottma/trail_trekker/refs/heads/main/plans.csv';
You can see here that DuckDB automatically handles the CSV parsing:
Now, when we run a basic SELECT * command from our CLI, we should see the data that populates this tables.
select * from features;
DuckDB also supports reading data from different file formats (JSON, Parquet, Iceberg, etc.), an https endpoint, or blob storage (AWS, Azure, S3, etc.). You can do this using something called extensions.
Debugging DuckDB
To make sure our table has been created as expected, let’s run a simple SQL query counting the unique ids in the table.
select count(feature_id) from features;
If we run a basic COUNT function on this table, you should get 8 as your output. This looks great!
If you’re having any issues, start by running SHOW TABLES
. This shows you all of the tables in your database.
I also recommend checking out the datatypes of the fields in the table using DESCRIBE TABLE
.
This shows me that the CSV parsing worked just as expected. DuckDB has invested heavily in having the best automatic CSV parsing functionality, so you shouldn’t run into too many issues. If you’d like to learn more about this, I highly recommend watching this video.
If you find an issue with your table at any point and need to recreate it, you can do this with the CREATE OR REPLACE TABLE
command.
CREATE OR REPLACE TABLE features AS SELECT *
FROM '/Users/madison/Downloads/data_pipeline_summer/features.csv';
🏆 Challenge: Set up DuckDB with Trail Trekker’s raw data
Now that you understand the powers of using DuckDB and know how to use it with raw CSV files, let’s create some data tables for Trail Trekker! You can download the CSV files you will need for this project here.
I want you to create a table for each of the data sources, validating that each of them is set up as expected. This will help you get familiar with the data you are working with and how to interact with DuckDB.
Next week we will be setting up our SQLMesh project using the DuckDB database we created this week!
As always, let me know if you have any questions or run into any roadblocks.
Have fun!
Madison
LOVE this breakdown!
Please Madison, where can I find the csv data that we will use in the project?