The Data Lakehouse: The Data Lake drops ACID

Analytics has long been highly silo’ed, from the days where the dashboard from desktop BI tools, monthly reports, and SAS data mining addressed different stakeholders on different platforms. Those silo’ed deepened when the ability to analyze “Big Data” became real in the early 2010s, as business analysts wouldn’t dare stepping into the world of the mysterious zoo animals, while data scientists decided that the traditional walled garden data warehouse environment was too limiting.

And so back then, we penned a series of deep dives while we were at Ovum dissecting what this meant for enterprises. Our take way back then was that the world of big data was for discovering “signals” in the data, performing some exploratory analytics to identify what questions should be asked, with the results following one of two paths depending on the impact of the decisions.

If the decisions have to be auditable (e.g., they impacted decisions such as granting financial credit), the data underwent all the essential cleansing, deduplication, and transformation processes to become curated and poured into a data warehouse. Here, it was essential to have confidence in the trustworthiness of the data, and that it is current and consistent. The picture had to be precise. When analytics lead to business-critical decisions, ACID is table stakes.

If on the other hand, the decisions are more about building leaderboards, identifying target audiences, optimizing website or mobile experiences, the key is having a “good enough” picture from all of the data. It wasn't that essential to ensure that every record was consistent, on the assumption that with sufficiently large and diverse data volumes, that the outliers would fall out in the wash.

In fact, we came up with a visual metaphor for it. So-called “Big Data” was meant for getting the big picture, while traditional data warehouses and data marts were set up for getting the precise picture, as this 2014 chart from one of our Ovum reports showed.

Figure 1. Big Data view circa 2014

Source: Omdia (formerly Ovum)

What’s changed is that the cloud has blown through all of the barriers to scalability that limited traditional data warehouses, while data lakes, running with de facto standard file formats such as Parquet that could structure the data, powered by data transformation capabilities from open source technologies like Spark, Drill, and Trino are delivering query performance getting almost on par with that of data warehouses. And if that's the case, the question then came, why can't we gain the same confidence in data sitting inside our data lakes that we have with data warehouses? The missing ingredient was ACID. Since data lakes already had formats for structuring data, the key was devising a software-defined table format that could sit atop the file formats like Parquet, CSV, or JSON, with data sitting in economical, highly durable cloud storage.

Enter the data lakehouse. The hybrid that enables a data lake to behave and perform like a data warehouse. While Databricks and Snowflake may debate when the term first originated, there is little doubt that Databricks popularized the term when it began open sourcing Delta Lake (the process took several years). In the interim, several open source data lakehouse table formats emerged alongside several legacy players introducing their own proprietary formats.

But more importantly, over the past year, the commercial ecosystem around open source data lakehouse formats, for now principally Delta Lake and Apache Iceberg, began crystallizing. It’s become what will likely be the prime battleground between Databricks and Snowflake. We’re also starting to see some, as yet unpublished, benchmarks showing that lakehouses reaching almost 80 – 90% of the performance of data warehouses. And we’re seeing unmistakable signs that open source will be where the action is.

Figure 2. Data Lakehouse Google searches

Source: Google Trends

This is all ahead of market awareness. Yes, as the chart shows, over the past year searches for data lakehouse have begun hitting a crescendo. But for now, the topic of data mesh is far more top of mind, as we’ve found from our LinkedIn posts over the past year. Nonetheless, with the technology pillars and the market ecosystem starting to fall into place, it’s time to ask how data lakehouses will reshape the analytics market. Because, in a few years, they will likely coopt many enterprise data warehouses.

We’ve just concluded about a half year of research on the emerging data lakehouse market landscape, and are now making the results publicly available for free download. We have published the results in two versions: A marketplace overview that pieces together the landscape, and a deep dive edition that is a superset, that goes down into the weeds with technical analysis of the different open source alternatives. Click here, and download your copy of the report.

Blog

The Data Lakehouse: The Data Lake drops ACID

The Data Lakehouse: The Data Lake drops ACID