- Navid Nassiri
How to deploy DataOps: Step 4 - Automate for real-time reporting
For data-driven business teams, the value of being able to create distinct KPIs from foundational data should be unassailable - which begs the question: “Why aren‘t we doing this now?”
The most common reason is operational capability. KPIs and foundational data can only be useful if the integrity and dependability of the underlying data is unquestionable. Such an endeavor is a complex technology problem, requiring real-time integration of heterogeneous data streams on top of a rock-solid and highly scalable operations platform.
To address these issues, a DataOps approach takes advantage of scalable technology, including cloud data warehouses, as well as software that monitors performance and validity at every step of the data pipeline.
Harnessing raw data with efficiency
Now you have a prioritized list of data sources and an understanding of how each data source and KPI will be handled, the next step is to ensure they are managed in a cost-efficient and scalable manner. To do this, you’ll need to consider the following:
Monitoring: How will third-party API uptime be monitored to ensure reliable delivery?
Problem triage: Once we‘re aware of a problem, how will we pinpoint if it‘s coming from the APIs, or the data warehouse, or somewhere in between?
Data quality: If a segment of data fails to load, or some portion of the data was malformed, how will we know? And how will we recover?
Data synchronization: How will we ensure that KPIs that depend on multiple sources have up-to-date components?
Change management: What happens when an API or data format changes?
Data scale: How will we scale up our processing capacity to handle event-level data that grows to billions of rows per month?
Operational re-use: Can the capabilities developed for one set of data sources be applied to all of my APIs and file-based sources, so that teams can collaborate using a single approach?
All of these challenges require automation. And while you can try to build an expensive and highly specialized team to write and maintain custom infrastructure, or hire highly-priced consultants to do the same thing, neither approach will deliver a long-term solution that takes advantage of DataOps best practices in a cost-effective way.
Using cloud data warehouses to consolidate data
Consolidating data streams into one place requires tools that scale affordably to handle growing amounts of data. In the past few years, commercial cloud-hosted solutions such as Google‘s BigQuery, Snowflake, and Amazon RedShift, have proven themselves best-in-class for this task.
However, some companies still invest in a “Do-It-Yourself“ approach, using traditional IT tools, developing custom software, and staffing ops engineers to maintain on-premises systems.
Comparing cloud approaches: BigQuery, Snowflake, RedShift, DIY
BigQuery: Launched in 2011, Google‘s BigQuery is an analytical data warehouse that can hold an unlimited amount of data, completely hosted in the Google cloud, so there is no on-premise IT to manage. Strengths include lightning-fast speed, and terabyte-scale datasets queried in seconds using SQL (a query language well understood by many data analysts). Pricing is based on how much data you query, so you only pay for resources you use. Furthermore, you can use permissions based on Google accounts to define access for your users, which streamlines use and security. While it‘s great for consolidating and querying data, it‘s not designed to handle data that must change over time. Once in BigQuery, it requires some developer expertise to manage. Because pricing is based on how much data you query, understanding the nuances of your data structure can‘t be overstated. For example, using daily tables will shorten your datasets by reducing the total rows per query for more cost-effective reporting.
Snowflake: Officially launched in 2014, with an IPO in 2020, Snowflake has grown to become one of the leading enterprise data warehouses today. Snowflake offers a variety of options for cloud technology, meaning you can choose the package and pricing that best suits your business needs. It also offers the ability to scale instantly, as well as providing more automated maintenance than some of the other data warehouses on the market.
RedShift: Launched in 2013, Amazon‘s RedShift product is completely hosted on AWS. As an extension of the AWS ecosystem, it offers easy integration with other AWS products and will require some IT/admin expertise to manage. RedShift looks and acts more like a traditional database than BigQuery. However, you pay by instance, not by the query. This means you‘ll most likely need an administrator to manage your instances, and the administration of large volumes of instances can be complicated and time-consuming.
Do-It-Yourself (DIY): One of the great things about working with enterprise technology is the can-do-it ethos that permeates our industry. Explaining the DIY approach in full is outside the scope of this blog post, but it‘s important to address because a number of companies do elect to go in this direction. The benefit of selecting a DIY solution is that you can tailor your solution to the unique demands of your business, to your existing IT infrastructure, and based upon in-house developer expertise.
If you’re unsure which approach to take at this juncture, check out our next post, in which we’ll explore the ‘build’ vs ‘buy’ methods of data automation. In the meantime, check out the previous posts in our DataOps series.