What is ‘load’ in ETL?
You’ve extracted your raw data, and you’ve transformed it into actionable, or what we call, foundational, data. The final phase of ETL involves loading this transformed data into the target destination, which is usually a data warehouse - such as Snowflake, Google BigQuery or Amazon Redshift - or a data lake.
L is for Load
Typically, a ‘full loading’ is implemented initially, which includes all existing data, followed by a periodic ‘incremental loading’ of any new or updated data. During incremental loading, the incoming dataset is compared with existing records to determine whether unique information is available. If so, then data points are overwritten or new records are created.
Factors to consider in data loading
While ‘load’ may seem like the most straightforward phase, there are also a myriad of questions to consider here.
The properties of the target destinations: Where are you putting the data, and who is responsible for maintenance? Is the dataset correctly cataloged? What about archiving historical data? Do you have well-thought out naming conventions?
Monitoring: How do you know if these processes aren’t running, and who’s monitoring it? How are alerts and outages communicated to the rest of the team?
Support for the business team: Does the business team, who rely on the dataset, know how to use it? What happens if some of the data fails to load, or is malformed? If a new source is needed, who does the business team need to contact? For example, if they find they want to integrate Snapchat data, how do they request this? Do they need to go back to the team taking care of the extraction phase?
Data governance: Is the dataset under any form of governance which requires certain rules to be followed? Is there any PII (Personally Identifiable Information)? How do you audit and regulate the data?
All of these factors – and the scale of the data involved – mean data warehouse management can quickly become expensive and time-consuming to maintain. That’s when automation can be a game-changer for companies with growing data sets to load, access and manage.