I always recommend everyone to write a style guide before writing any code in their first dbt project.
Without a set of standards in place, code quickly goes off the rails. Nobody knows the conventions they should be following while writing.
Leading commas or trailing commas? CTEs or subqueries? Prefixes and plurals used in model names?
This becomes especially important when you have more than one analytics engineer on your team. To understand one another’s work, and pick up from where one another left off, you need to be following the same conventions.
However, there may come a time when you‘ve already begun working in dbt, documenting and building out models, and you notice some of the conventions you defined no longer work for you.
When this happens, don’t ignore it. Address it before your problem becomes bigger.
Lately, in my team, we noticed things becoming more and more disorganized. While we had a style guide in place, it didn’t address the problems we were facing.
Core models weren’t true core models. They were referencing staging models because no intermediate models existed to solve our use cases. The default directory for any model we needed became models/intermediate
.
We needed to make some changes so we can scale in a way where our dbt project makes our jobs easier, not harder.
Here are some project signals that indicate it’s time to reorganize things:
You don’t know which model to use to solve your problem.
There are two ways this problem can go. Either you don’t have any data model you need to solve this problem. Or you have too many data models that could potentially solve the problem but no clear difference between these models.
When discussing reorganizing our dbt project, we are most likely thinking about the second problem. We have the data transformations we need available somewhere, it’s just unclear where that place is.
This typically means a few things:
your models aren’t reusable/modular
you don’t understand how your data is being used downstream
you are lacking clear documentation
Your model should be named and defined in one place so users know exactly where to go when they have a related question.
On my team, we weren’t clear about the purpose of an intermediate model. We turned all of our code into intermediate models rather than thinking about reusable logic.
We had 4 related intermediate models which resulted in one over-arching core model. However, these could have been written as one intermediate model that was then aggregated and joined into another more general core model.
Some helpful questions to ask yourself when deciding how to define intermediate and core models:
What types of sources (staging, intermediate) should be referenced in these models? What types should never be referenced?
Will this model be used downstream in other models?
How often will this model be used to answer business questions?
The rules will be different for every team. Think about your answers and how those can be translated into a structure that works for everyone.
You are spending too much time searching for what you need.
How much time do you spend looking for the documentation for a specific source? How much time do you spend looking for the model you want to use?
If the answer is too much time, then you may want to change the names of your subdirectories.
dbt recommends naming subdirectories by source for staging models and by business function for core models. Intermediate models have little guidance behind naming conventions.
I tend to agree with organizing staging models by source, but not necessarily with organizing core models by business function. A lot of core models should be used across multiple functions within the business.
For example, an accounts model will be used by product, finance, and growth teams!
I’m finding that it’s helpful to organize by objects important to all of the business and keep model names as simple as possible.
Now, the organization of intermediate models tends to be the wild wild west. Do I organize by data source? By business function? By object?
It depends.
We used to organize our models by data source, similar to staging, but fing that no longer makes sense as we begin to scale out our data models. After all, most intermediate models will reference multiple data sources.
Organizing these models by object has allowed us to map the models back to the final core model, named after this same object. This makes it clear exactly what is going into each of our core models (or what can go into them).
Keep reading with a 7-day free trial
Subscribe to Learn Analytics Engineering to keep reading this post and get 7 days of free access to the full post archives.