Some may argue that data ingestion is the most important step in your data pipeline. I’d also argue that it’s the step where the most can go wrong.
Without a reliable data ingestion tool, you will always question your data.
You may wonder…
Are there missing records in my data?
Is my data arriving on time?
Are schema and data type changes being properly handled?
When you question your data, you can’t depend on it for the right answers. If you can’t rely on your data, what can you rely on?
Not to mention, data ingestion tools are EXPENSIVE. The best, most reliable tools aren’t affordable for many companies. You’re then forced to decide between cost or reliability.
However, enterprise tools also don’t necessarily equal better. While in theory there should be more support for paying customers, I’ve found this untrue. In the process of exploring new data ingestion tools, I was ghosted by sales reps and left guessing about sync errors due to a lack of logs.
With all this being said, in today’s rendition of Data Pipeline Summer, we will first discuss important concepts when configuring a data ingestion tool. These include:
sync methods
propagating schema changes
raw vs normalized data
We will then learn how to send data from Google Sheets to Snowflake using Airbyte, an open-source data ingestion tool. Lastly, I’ll send you off with this week’s challenge of setting up your own Airbyte sync!
Shoutout to Andrew for answering the bonus question in the last email correctly and winning himself a copy of The ABCs of Analytics Engineering! Stay tuned for more bonus questions for a chance to win your own copy.
Keep reading with a 7-day free trial
Subscribe to Learn Analytics Engineering to keep reading this post and get 7 days of free access to the full post archives.