ETL

What is ‘extract’ in ETL?

Switchboard Apr 29

SWITCH-Blog-What is extract in ETL-L1R1
Table of Contents

    In this latest blog series, we’re focussing on ETL, a process for unifying data that stands for “extract, transform, and load”. Or for some DataOps teams, the process may be ELT (extract, then load, then transform).

    However, ETL is not as simple as E, T and L. There are simply too many factors to describe your DataOps as a three-step process. The fact is, for each of the steps represented by these three letters – regardless of the number of data pipelines involved – there are a plethora of questions that need to be answered before you can get started.

    Let’s start by looking at the letter E for Extract. What exactly does this part of the process involve and what questions do we need to ask before we can start extracting the data?

    E for Extract

    When people hear the term ‘extract’, they assume it simply refers to the process by which raw data is exported or copied from its original source, ready to be transformed, before it is loaded into its final destination. In reality, successful extraction of useful, actionable data refers to a whole lot more than this.

    Factors to consider in data extraction

    Timing is important: When do you extract the data? What timezone are the timestamps in the dataset?

    Sometimes a data source fails or is missing: Do you then re-pull data from previous periods and backfill? How often do you check for errors? Backfilling is a critical, yet in our experience, often overlooked requirement for data integrity. If glitches occur frequently, you need to consider how to backfill a) with the correct data, b) with data at scale, and c) immediately upon the connection being restored – not hours, days or weeks after the event. These considerations all fall under the extraction process, but when done manually, they soon become repetitive, inefficient and error-prone, requiring a ton of person hours to fix exponential rows of data.

    Sometimes extraction involves API calls to a data source: Do these have a rate limit? Does the API have a quota? What happens if the API is down? Do you know how the API works? Who maintains the code used to connect to the API? Do you have engineers who know how to manage sudden or unexpected API changes?

    Security and regulatory compliance: What credentials are required to extract the data? Data extraction needs to be carried out securely – how is this security maintained? Who is responsible?

    The scale of the data: How large are the datasets? Do you have the resources to extract that much, not only in terms of compute, storage, and scale, but also time and engineering resources? What happens when they grow in variety or volume?

    How will these factors affect the overall cost of your DataOps?: Have you budgeted enough for the engineering work involved in extracting all your data? Are you extracting too much data, leading to unnecessary warehousing costs?

    All these questions – and many more – must be answered comprehensively for the extraction of your data to be successful. Is your team equipped to respond to all these things effectively? If you’re still attempting to manage this process manually, then possibly not, and it may be time to consider automating your ETL.

    In our next post, we’ll explore the in and outs of the ‘T’ or ‘transform’ phase in ETL.

    In the meantime, check out our blog page for everything you need to know about ETL and data automation, or check out ETL: The Ultimate Guide.

    If you need help unifying your first or second-party data, we can help. Contact us to learn how.

    Schedule Demo

    Share

    Catch up with the latest from Switchboard

    subscribe

    STAY UPDATED

    Subscribe to our newsletter

    Submit your email, and once a month we'll send you our best time-saving articles, videos and other resources