How to Solve the Top Data Team Challenges
Addressing the State of Analytics Engineering 2024 report
dbt recently released the State of Analytics Engineering 2024 report based on data collected over the last few months from dbt users. The report details data like salaries, success metrics, and areas of investment.
However, the data that interested me the most was on top data team challenges. According to the report, data teams struggle the most with:
Poor data quality
Ambiguous data ownership
Poor stakeholder data literacy
Integrating data from various sources and documenting data products comes trailing close behind these challenges.
None of these challenges surprised me because they are the same ones we face at ConvertKit.
Honestly, I’d be surprised if any data team doesn’t face these challenges, as they are multi-layered and quite complex.
In this article, I’ll discuss how I’ve struggled with these challenges and the different ways in which I’ve helped improve upon them.
My hope for this post is that we all share how we’ve struggled and overcome our hurdles with data quality, ownership, and stakeholder data literacy by sharing in the comments. This way, we can each get a little bit closer to tackling the obstacles in our way.
Poor Data Quality
I’m not sure this challenge will ever truly go away, but there are ways we can get better at spotting data quality issues. One of those is testing.
When adding any data sources to your environment, or writing any data models, you always need to consider the points at which things can go wrong. By considering testing while you are building, the data is fresh in your mind and you are guaranteeing that nothing goes live before it has been tested.
Again- don’t ever push something to production without having proper testing in place. Even if a stakeholder begs you for data. It’s always worth it to take the extra time to understand the challenges you are up against.
When testing your data, consider the following:
freshness for entire tables but also certain field values
data volume anomalies by time period
schema changes
presence of NULL values with conditions
If you want to learn more about generic testing and free testing packages in dbt, I’ll cover this extensively in my live Transform Your Data Stack with dbt course starting in May. We will build complex tests to check for the potential issues I mentioned.
You can also check out these past newsletters I’ve written on data quality:
⭐️ 7 Pillars for a Data Quality Framework
⭐️ Tutorial: Write a Custom Generic dbt Test
Reverse ETL-enabled platforms
Oh, Hubspot… if you work with Reverse ETL and Husbpot, you probably understand the pain. Hubspot is a tool with a lot of cooks in the kitchen, often making it hard to pinpoint data quality issues.
Is this a reverse ETL problem?
Is this caused by some other tool that’s integrated into Hubspot?
Is this caused by a user setting something up incorrectly?
There are so many possibilities for data quality issues in complex systems like this that it makes it nearly impossible to track down the root cause. Not to mention how out of hand all the different fields, workflows, and objects can get.
How do you test for data quality issues in the platforms you send data to? When do you know it’s an issue with the data in your warehouse versus another tool connected to the platform?
Keep reading with a 7-day free trial
Subscribe to Learn Analytics Engineering to keep reading this post and get 7 days of free access to the full post archives.