Yesterday, I made some changes to a staging model in dbt and an hour later noticed a failure in the data pipeline.
Lo and behold, there was a legacy model that depended on a column I had just renamed. However, because this model was written incorrectly, referencing sources directly rather than their staging models, it didn’t appear in the DAG generated by dbt.
When I started digging through this model, I could barely read it due to poor formatting and conflicting style conventions. I’m talking about inconsistent naming, leading commas, and subqueries inside of CTEs…
Code written like this is nearly impossible to debug. Mainly because I couldn’t even read it to understand what it was doing.
Every time I see code like this, I get a little sad. Someone spent a lot of work writing this and it can’t even be reused. In cases like this, it’s better to spend time writing a new model from scratch than it is to understand the code already written.
While the technicalities of your SQL code are important, a lot of times it’s the seemingly simple practices that separate poorly written code from well-written code.
This legacy model reminded me of the SQL mistakes I frequently see in dbt models, but also in analyst queries. In this article, I’m going to break down each of these 4 mistakes and how to fix them.
This way, you will be writing not only performant code but readable code that can last the test of time.
Mistake #1: Using nested subqueries
Subqueries are queries that reference another query rather than a permanent or temporary table. They look something like this:
SELECT
customer_name
FROM (SELECT account_id, customer_id, customer_name
FROM customers
WHERE YEAR(created_at)=2023)
While this is a very simple example of what a subquery looks like, you can probably see how these can quickly get hard to manage.
Imagine using multiple subqueries in one query, or even worse, a subquery within a subquery. These are nearly impossible to understand because you can’t tell how the code is transforming from one subquery to the next.
To fully understand the purpose of each subquery, you need to refactor the code into CTEs or run each query individually within your warehouse.
So, if you’re going to have to change the code to understand it, why not start with writing it in an easy-to-read way?
SOLUTION: Use CTEs instead of subqueries, even if they seem quite simple. Simpler is better than overly complex.
WITH
new_customers_2023 AS (
SELECT
account_id,
customer_id,
customer_name
FROM customers
WHERE YEAR(created_at)=2023
)
SELECT customer_name FROM new_customers_2023
Again, this is a simple example, but you can already see how much more readable this is. We can now identify that we are selecting customer_name
from a list of our new customers in 2023.
The CTE allows us to add a description of what the query does (in the form of its name) and format the query so we can see it as its own chunk of code.
Mistake #2: Assigning undescriptive naming
Now that we’ve reviewed why CTEs are so much better than subqueries, I want you to look at the above example and imagine that the CTE’s name wasn’t so descriptive.
Part of the reason the above code is so easy to understand is due to the name given to the CTE. You know exactly what data it contains without even reading the code.
This is why I recommend avoiding using abbreviations to name your CTEs, or even aliasing your tables.
When joining two tables, it is much easier to see which fields come from which table when they are aliased using the table name. For example, in dbt, all of your SQL code will read from a {{ ref(‘‘) }}
. I recommend aliasing this reference with the model name minus any prefixes.
SOLUTION: Name your CTEs with descriptive names explaining the action that is being taken. Use table aliases that clearly distinguish which table a field is coming from.
SELECT
order_lines.order_line_id,
order_lines.product_id,
orders.order_id,
orders.order_created_at
FROM {{ ref('stg_shopify__order_lines') }} order_lines
LEFT JOIN {{ ref('stg_shopify__orders') }} orders
ON order_lines.order_id = orders.order_id
Now if I were to have used abbreviations with these references, it would have been much harder to connect the dots of which order-related field belongs to which model.
SELECT
ol.order_line_id,
ol.product_id,
o.order_id,
o.order_created_at
FROM {{ ref('stg_shopify__order_lines') }} ol
LEFT JOIN {{ ref('stg_shopify__orders') }} o
ON order_lines.order_id = orders.order_id
If you insist on using abbreviations (because sometimes model names can get very long), at least choose abbreviations that clearly represent each table.
Please can we stop aliasing our tables with a
and b
!?
Mistake #3: Not considering the end user
It’s easy to start writing SQL, dead set on deriving the logic that solves your problem. However, when we start coding without thinking about our end user in mind, we are led further from the end goal.
I’ll be the first to admit my guilt here. With reverse ETL models, I will often start writing code without realizing how the business will be using it.
I forget to ask the important questions like:
Can there be NULL values in the destination field?
Does the stakeholder filter values on TRUE/FALSE or Y/N?
If one field is FALSE, should another related field have any value at all?
For example, I recently was working on a model that sends data to Hubspot. For one particular field (let’s call it is_verified_account
), my upstream logic pulled only a list of accounts that were verified.
Therefore, when joined to the downstream model, only TRUE values were populated for is_verified_account
. Stakeholders automatically thought something was wrong with the data because they weren’t seeing any FALSE values.
Luckily, I was able to add a simple NVL2 function to change this, ensuring NULLs populated as FALSE.
NVL2(is_verified_account, TRUE, FALSE)
SOLUTION: Take the time to understand how the code is being used by the end user. Take that into account before writing out the logic.
Keep reading with a 7-day free trial
Subscribe to Learn Analytics Engineering to keep reading this post and get 7 days of free access to the full post archives.