What is ‘extract’ in ETL?
In this latest blog series, we’re focussing on ETL, a process for unifying data that stands for “extract, transform, and load”. Or for some DataOps teams, the process may be ELT (extract, then load, then transform).
However, ETL is not as simple as E, T and L. There are simply too many factors to describe your DataOps as a three-step process. The fact is, for each of the steps represented by these three letters - regardless of the number of data pipelines involved - there are a plethora of questions that need to be answered before you can get started.
Let’s start by looking at the letter E for Extract. What exactly does this part of the process involve and what questions do we need to ask before we can start extracting the data?
E for Extract
When people hear the term ‘extract’, they assume it simply refers to the process by which raw data is exported or copied from its original source, ready to be transformed, before it is loaded into its final destination. In reality, successful extraction of useful, actionable data refers to a whole lot more than this.
Factors to consider in data extraction
Timing is important: When do you extract the data? What timezone are the timestamps in the dataset?
Sometimes a data source fails or is missing: Do you then re-pull data from previous periods and backfill? How often do you check for errors? Backfilling is a critical, yet in our experience, often overlooked requirement for data integrity. If glitches occur frequently, you need to consider how to backfill a) with the correct data, b) with data at scale, and c) immediately upon the connection being restored - not hours, days or weeks after the event. These considerations all fall under the extraction process, but when done manually, they soon become repetitive, inefficient and error-prone, requiring a ton of person hours to fix exponential rows of data.
Sometimes extraction involves API calls to a data source: Do these have a rate limit? Does the API have a quota? What happens if the API is down? Do you know how the API works? Who maintains the code used to connect to the API? Do you have engineers who know how to manage sudden or unexpected API changes?
Security and regulatory compliance: What credentials are required to extract the data? Data extraction needs to be carried out securely - how is this security maintained? Who is responsible?
The scale of the data: How large are the datasets? Do you have the resources to extract that much, not only in terms of compute, storage, and scale, but also time and engineering resources? What happens when they grow in variety or volume?
How will these factors affect the overall cost of your DataOps?: Have you budgeted enough for the engineering work involved in extracting all your data? Are you extracting too much data, leading to unnecessary warehousing costs?
All these questions – and many more – must be answered comprehensively for the extraction of your data to be successful. Is your team equipped to respond to all these things effectively? If you’re still attempting to manage this process manually, then possibly not, and it may be time to consider automating your ETL.
In our next post, we’ll explore the in and outs of the ‘T’ or ‘transform’ phase in ETL.