Why ETL tools throw up more questions than answers
First generation ETL tools promise to easily aggregate disparate data sources and create data ready for analytics.
But since they are tools as opposed to solutions, they create new questions while providing early answers. How do you monitor the process? Who creates the business rules? Is there ongoing verification and testing? How are required changes implemented? The list is endless. In the end, most tools don’t qualify as a complete solution as the resulting data requires more engineering.
The answer lies in comprehensive data engineering automation. What do we mean by “comprehensive”? ETL has to integrate with a workflow that a business can own and deliver foundational data. Otherwise, you’re not solving the problem; you’re just moving it downstream. ETL is important, but it’s part of a much bigger solution.
For the purposes of this blog post, however, let’s focus on the mechanics of ETL.
While extracting, transforming, and loading data for one pipeline may be just about manageable, doing this at scale across an enterprise quickly becomes unsustainable.
So let’s take a look at just some of the considerations for E, T and L which demonstrate just how complex it is to manage a data pipeline.
When extracting data…
How large is your dataset and do you have the resources to extract that volume?
Do you know when to extract your data and in which timezone?
How do you account for missing or faulty data? Do you re-pull data from previous periods? How often do you check for errors?
If extraction involves API calls to a data source, is there a quota on the number of calls?
What happens when an API suddenly goes down or changes unexpectedly?
Are you extracting in a secure manner and what credentials are required?
How much will the extraction phase cost your company and have you budgeted enough?
When transforming data…
Are you able to verify what you are transforming is useful data, i.e. effectively flag and remove anomalies before the data is transformed?
How will you format, sort and configure your data to match the schema of the intended target location?
Are you confident you’re labeling your files correctly and consistently to ensure the data can be loaded and used properly?
Are you sufficiently encrypting or removing any dataset which is subject to regulation?
When loading data…
Do you know where you are putting the data, and who is responsible for cataloging, archiving and maintenance?
Do you know who is monitoring your data and how alerts and outages are communicated to the rest of the team?
Does the business team - who rely on the dataset - know how to use it, or who to contact if some of the data fails to load, or is malformed, or if a new data source needs to be integrated?
Is the dataset under any form of governance that requires certain rules to be followed, and how are you handling any PII (Personally Identifiable Information)?
Moving beyond first generation ETL tools
As you can see, when we dig deeper and unearth the minefield that is ETL, there is a lot to think about before you even begin each process. And, if like most data-driven enterprises, you’re managing millions, or even billions of rows of daily data, you’ll agree that applying scalable and comprehensive automation to the process is the only solution for building and maintaining a strategic, trustworthy data asset.
A comprehensive automated ETL platform aggregates disparate data sources reliably and in real time, delivering the foundational data needed to uncover insights, so you can focus on making timely, strategic business decisions to drive revenue.