Data Warehouse Scanning & Optimization 101

#1 Data Warehouse December- setting the foundation

Dec 01, 2024

∙ Paid

4 years ago, I began my first official analytics engineering role. As the only data hire besides my manager, I owned the entire setup of our data stack. While I decided on a majority of the tools we used, my manager already set up Snowflake as our data warehouse.

Working in purely engineering roles prior, this was my first experience using a data warehouse for analytics. I learned everything I knew about data warehousing at the time from my experience with Snowflake. I didn’t know anything else!

To learn Snowflake, I depended on the docs and Snowflake-specific articles to set it up and save costs. Instead of learning the foundations of data warehousing, I learned how to use a tool.

Then, when I joined Kit, where we use Redshift, I realized that a lot of my knowledge no longer applied. So many things about this new data warehouse differed from Snowflake. Because I didn’t have the foundational knowledge, it felt like I was starting from square one again.

A few weeks ago, I shared on LinkedIn that you need to master the foundations before you can master a specific tool. I still stand by this, which is why I want to focus on the concepts in this series rather than the tools.

When you master the underlying concepts, you can apply them to any tool. Your knowledge isn’t dependent on just one thing but can be applied to whatever data warehouse you decide to use.

Don’t make the same mistake I did. Take the time to understand the foundational concepts that can be applied to Snowflake, Redshift, Firebolt, BigQuery, you name it.

In today’s newsletter, we’ll start by introducing some key data warehouse concepts that will help you better understand query optimization. In the following newsletters, we will dive deeper into how to use these in your data warehouse and how different data warehouses handle them.

How does a data warehouse work?

Let’s compare our data warehouse to a grocery store. The data is organized into column-based fields, just as the food in a grocery store is organized into aisles. Each item has a specific place in an aisle comprising certain categories.

Grocery stores are organized into aisles to help customers easily find what they need. This way, if you need to run in quickly to grab something, you can find the aisle, narrow down your search for the item, and buy that item as fast as possible. After all, most of us don’t go to the grocery store to spend all day in there.

A data warehouse works similarly. Different techniques help our queries run faster, ensuring our data arrives as soon as possible. Data pruning represents the organization of the store and the experience of looking for a specific item. Indexes act as the shelf locations of the items. Partitioning acts as the aisles.

Views represent a special order that needs to be placed for a Thanksgiving turkey or a Christmas ham while tables represent physical items you can buy right away.

Let’s explore deeper into data pruning, indexes, and partitioning.

Keep reading with a 7-day free trial

Subscribe to Learn Analytics Engineering to keep reading this post and get 7 days of free access to the full post archives.