What is ‘transform’ in ETL?
In our last post, we looked at just some of the factors that need to be considered to extract your data successfully. For example, without determining when, how and what you are extracting, you risk ending up with an inconsistent and therefore untrustworthy dataset. The same is true when it comes to transforming your data.
T for Transform
Once your raw data has been extracted successfully, it’s ready to be transformed into foundational data. However, the transformation phase is complex and must be considered carefully before any process is applied to the data.
During the transformation phase, a series of rules are applied to the extracted dataset to enrich it, aggregate it or normalize it for its intended use. This involves several overarching processes, each with their own set of sub-processes.
Factors to consider in data transformation
Data cleansing: Inconsistencies and missing values in the raw data must be resolved. This involves filtering, deduplication, validation, and authentication.
Verification: Unusable data and anomalies may need to be flagged and removed as part of the transformation process. This involves auditing the dataset to ensure the highest quality, but also that it is compliant with its use-case.
Formatting and sorting for intended use: The data must be configured to match the schema of the intended target location.
Labeling: If files are being produced for downstream use, they should be labeled correctly and consistently to ensure the data can be loaded and used properly.
Governance and protection: Any dataset which is subject to regulation may need to be encrypted or removed.
Questions to ask in data transformation
As with extraction, each step of the transformation phase will throw up a number of questions and complications. What are you transforming? Where are you transforming? How do you know the rules you are applying are correct? How can these rules be audited? Do you have the processing power to apply all these rules? What’s the processing cost?
And it doesn’t stop there. You also need to consider the following: When the extraction process changes in production, such as Facebook adding a new data column, how do you manage that change? How will this affect scheduling? Do you return to the extraction phase to backfill and fix it? How do you track the changes made to your processes?
The scale of the dataset will also affect a number of factors: How do you process a large dataset, and how fast will it be? Who oversees validation? For example, this could be an engineer or a data analyst.
Why you need automation in data transformation
As you can see, this stage isn’t simply a case of hitting a button and hoping for the best. Instead, it’s an iterative process that needs continual monitoring to ensure data consistency and quality throughout. A process that quickly becomes unsustainable if your teams are still doing it manually. This is why automation is crucial during the transformation stage of ETL.
In our next post, we’ll break down the third and final phase of ETL - L for Load.
For all you need to know about extracting, transforming and loading your data effectively and at scale, check out ETL: The Ultimate Guide.