• Navid Nassiri

Which ETL tool is best for Google BigQuery?


To unify data from multiple sources into a single location successfully, you need both an ETL (‘Extract, Transform, Load’) pipeline and a data storage solution.


Today, this is typically a cloud-based EDW (Enterprise Data Warehouse), one of the most popular being Google’s BigQuery.

Launched in 2011, BigQuery is an analytical data warehouse hosted entirely in the Google cloud, which avoids the need for local servers. If you’re already planning to use BigQuery for your data storage, the next step is to choose an ETL tool for building your pipeline.


Which tool is best for ETL?


BigQuery is a managed data warehouse that also provides inbuilt data analysis features, such as ML and a BI engine. You can then deploy your ETL tools in the data warehouse. But how does BigQuery compare to competitors, such as Snowflake and Amazon Redshift?


One of the main benefits of cloud ETL tools is storage flexibility, where the size of the data repository can be increased or decreased easily and quickly. Like BigQuery, both Snowflake and Redshift decouple storage and compute nodes so that storage and processing power can be changed independently. However, one of BigQuery‘s core strengths is speed, since it can query terabyte-scale datasets in seconds using SQL.


All three services offer on-demand pricing that’s based on the volume of data you query, so you only pay for the nodes you use. They also provide the ability to reserve nodes at a discount. With BigQuery, you can use Google accounts for permissions to define access for your users, which streamlines use as well as security.


While BigQuery is an excellent tool for consolidating and querying data, it won’t automatically deal with data that changes over time. So, once your data is in BigQuery, you’ll need a team of developers to manage it - or sophisticated automation tools. Also, since pricing is based on the amount of data you query, you ought to structure your data intelligently to keep costs to a minimum. For example, using daily tables will shorten your datasets by reducing the total rows per query for more cost-effective reporting.


What technology is used in BigQuery?


BigQuery is STaaS (Storage as a Service) so there are no databases or software to install or manage. Therefore, you don’t need to have an intricate knowledge of the underlying technology to use it. However, understanding some of the architecture and operation can help you to import your data and use the technology effectively. In addition, adopting best practices can maximize performance and minimize cost.


BigQuery is built on Dremel, an interactive query system for analyzing read-only nested data, used by Google internally since 2006. BigQuery incorporates the columnar storage and tree architecture of Dremel, but goes much further.


It also uses other Google technologies. For instance, Borg, a large-scale cluster management system, allocates compute capacity for jobs. Dremel reads data from the Colossus file system using the Jupiter network, before executing SQL operations.


You can import data into BigQuery using either batch or streaming loads. If you’re using an external data source, BigQuery will load on-the-fly into the Dremel engine. However, queries against external data sources are typically slower than native calls. Performance is also dependent on storage type. For example, queries against files in Google Cloud Storage will be faster than against those in Google Drive.


Does BigQuery use ETL or ELT?


BigQuery supports transformation both before and after loading, so you can use either ETL or ELT. However, the different approaches require different implementations. For example, when using ELT, transformations can be applied more efficiently using SQL once the data has been loaded into BigQuery. Many prefer to use ELT over ETL for BigQuery and other cloud data warehouses, since storage is cheap and you can easily transfer all of your raw data.


If you’d like to learn more about ETL, check out our ultimate guide. Switchboard provides a modern data unification platform in which you can outsource your ETL needs for better and faster results - contact us to learn more.

Recent Posts

See All