What is an ETL process example?
To understand the complexities of ETL, it’s often easier to first examine an ETL process example. Here, we’ll take a detailed look at some aspects of an ETL process in a data warehouse environment.
What does an ETL process require?
Before considering an ETL process in a data warehouse, we first need to consider its requirements. Here are some of the most common.
Business needs: The business needs are the information requirements of the end users, which means identifying the data sources which must be introduced to the data warehouse. Gathering this information should involve interviewing the end users, and investigating possible data sources.
Compliance: Legal regulations require you to demonstrate that the information reported by the data pipeline is accurate, complete, and unaltered. This may include keeping copies of data sources and stages of data, proof of the transformation flow which changed data, documentation of the algorithms used, and proof of the security of data copies.
Data profiling: This is a necessary precursor to the five steps of the ETL process. Data profiling uses analytic methods to examine large volumes of data and understand the content, structure, and quality, then uncover any issues that need to be addressed. For example, a stocktake system may not contain the fields needed for the data warehouse.
What is ETL architecture?
An ETL architecture is the plan or blueprint used to build the ETL pipeline, and describes how data will flow from the sources, undergo its transformations, and load into the target locations.
Constructing an ETL architecture usually follows these steps:
Identification of business needs – What is the objective of the ETL pipeline? For example, increase financial processing efficiency, or improve marketing.
Selection of data sources – Which data will your pipeline need, and where is that data located?
Selection of target destination – It’s important for your completed data to end up in the most suitable location for it to be used. For example, you may need to choose between a cloud-based or on-premises data warehouse.
Choose between ETL vs. ELT – Sometimes it’s more suitable to load the extracted data directly into the destination before applying transformations. The suitability of this method will depend on the particular data pipeline being constructed.
Choose between batch vs. streaming – Batch processing is where the data is stored and processed in batches, usually periodically. Streaming processing is where data is extracted, transformed, and loaded continuously. Again, the suitability of each method will depend on the particular data pipeline.
Check data quality – Identify any data integrity issues, and the tools which can be used to correct them.
Plan maintenance & improvements – There will always be opportunities to correct issues and increase optimization, so it’s important to schedule ETL testing and review the process periodically.
What must be completed during the extraction step in the ETL process?
The extract step of the ETL process is more complex than it may initially appear. There are many factors to consider during the extraction phase, so here are some of the questions which need to be answered:
Timing is important – When do you extract the data? What timezone are the timestamps in the dataset?
Sometimes a source fails or is missing – Do you then re-pull data from previous days and backfill? What do you backfill and how? How often do you pull the data?
Sometimes extraction involves API calls to a data source – Do these have a rate limit? Does the API have a quota? What happens if the API is down? Do you know how the API works? Who maintains the code used to connect to the API? Do you have engineers who know how to manage sudden or unexpected API changes?
Security and regulatory compliance – What credentials are required to extract the data? Data extraction needs to be carried out securely - how is this security maintained? Who is responsible?
The scale of the data – How large are the datasets? Do you have the resources to extract that much, in terms of compute, storage, and scale, but also time and engineering resources? What happens when they grow in variety or volume?
All of these factors affect cost – Have you budgeted enough for the engineering resources involved in extracting all your data? Are you extracting too much data, leading to unnecessary warehousing costs?
The extraction step of the ETL process describes the relationship between source and target data, so cleaning and preprocessing are heavily involved. All these questions – and many more – must be answered comprehensively for extraction to be successful.
What are the different types of data loads?
There are two main types of load in ETL: ‘Full’ loading and ‘incremental’ loading. A full load moves all of the completed dataset to its target destination in the data warehouse, whereas an incremental load moves only the differences between the source and the target. Typically, a full load will be completed initially, and then incremental loads will be periodically performed to keep the finished dataset up-to-date.
As you can see, a full ETL process example is highly complex, and we have only scratched the surface here. You can learn more about the entire ETL process in our ultimate guide. Switchboard provides a ready-made data unification platform that takes care of ETL for you, so speak to us to find out how we can meet your business needs.