Are Data Meshes ready for prime time

In the data world, few other topics have taken over the conversation during the past year than Data Mesh. there are fewer topics that are drawing more discussion than data mesh. Just look at Google Trends data for the past 90 days: searches for Data Mesh far outnumber those for Data Lakehouse; just about the only topic that comes close in search activity is data fabric.

The idea of data mesh was originated by Zhamak Dehghani, director of next tech incubation at Thoughtworks North America. Se introduced the idea back in 2019, and then followed with a drilldown in principles in late 2020. And in the coming year, it will culminate in a book (if you're interested, Starburst Data is offering a sneak peek).

And, not surprisingly, we’re already seeing data management vendors, from catalog to data pipeline and analytic platform providers, cast their products in a data mesh glow. Despite the technology hype, data mesh is not a technology stack. Rather, it is a process and architectural approach that delegates responsibility for specific data sets to domains, or areas of the business that have the requisite subject matter expertise to know what the data is supposed to represent and how it is to be used. In fact, keep the term domains fresh in your mind – it is central to what is supposed to drive the data mesh.

Over the years, we’ve seen conventional wisdom in the data management space swing like a pendulum between centralization of data consolidation and distribution management. The era of minicomputers, PCs, and later, “open systems” elevated the role of the departmental database. But with the proliferation of departmental systems came a consolidation wave, as exemplified by Oracle Exadata. The going notion was, even if the database instances were separate, at least bring them under a common management umbrella.

Then came a series of Cambrian explosions sparked by the Internet that both scattered and then centralized data: emergence of the LAMP stack, built around MySQL databases for those modest web apps that didn’t need the armor of an Oracle (or even SQL Server) database, followed the harnessing of Big Data, where conventional wisdom was to funnel all that miscellaneous data into Hadoop clusters where it could be brute force processed through MapReduce. Subsequent emergence of smartphone and IoT data further proliferated data, and data sources, to the point where cloud storage became the dumping ground... err, de facto data lake. With cheap storage, serverless compute, and multimodel data warehouses, suddenly all that data became fair game for analytics and ML models. It’s nothing new to be awash in data, but the magnitude by which torrents of data are being generated and becoming accessible is spurring a crisis of plenty: if you’re a business analyst or data scientist, where do you start?

The data mesh emerged to restore context to data. At some point, somebody made a decision to collect that data, and according to the principles of data meshes, that's the person, group, or “domain” that should take ownership of it. And given that data is meant to be used, it should be treated as a “product.” As such, a data product is a lot more than just a dataset. And so, a data product also envelopes the code for the data pipelines that generate and transform the data; the associated metadata; infrastructure (how and where the data is stored and processed); security (which classes of users are authorized to use the data and whether they can see actual or masked data); and lifecycle concerns. And product developers from the domains responsible for the data should have self-service platforms to manage the lifecycle, and that governance be federated: local ownership, subject to enterprise mandates.

As recovering SOA addict from the early 2000s, we were naturally cynical about the feasibility of deconstructing data management, just as SOA was to deconstruct applications from monolithic architecture – abstract the code (or data) from its physical implementation. Of course, it could be stated that with cloud-native architecture that SOA eventually won – although we term it containers and microservices that run on loosely coupled infrastructure. But, excluding companies that were born in the cloud, you will be hard-pressed to find an enterprise that has transformed and rearchitected all of its systems – both front and back end – to microservices. Instead, the apps that are being deployed cloud-native are those that tend to be very dynamic in their consumption of resource and connections with different data and other microservices.

That is more of a broad hint to us that data meshes may work well for subsets of related scenarios. We’re a bit leery at how well the idea of devolution of data management to a bottom-up process and architecture pattern will scale. The obvious potential issue is one of recreating or formalizing more silos – that's the last thing that any organization needs. And, related to silos is the reality that the boundaries between data ownership are likely to be blurred, especially given the reality that organizations need a common source of the truth – datasets are more likely than not to be shared, and with it, comes questions of ownership.

Then there's the matter of governance; while the data mesh calls for local domains, who know what the data is and how it’s supposed to be used, take the initiative on governance. But of course, that can’t come at the cost of violating corporate mandates or external regulations. Dehghani calls for federated governance e—a necessity that will take significant trial and error for organizations to get it right.

Data meshes have often been compared to data fabrics; data meshes are distributed views of the data estate whereas data fabrics apply more centralized approaches to building a common metadata backplane. In all fairness, this is a false dichotomy; you can have distributed governance that is fed by a common metadata backplane. Just as we wonder if the data mesh can scale, we can add the same critique to data fabrics: can they really cover everything?

There's a good reason that we’re having this debate. Centralized architecture, such as an enterprise data warehouse, data lake, or data lakehouse, can't do justice in a polyglot world. The idea of a monolithic enterprise data warehouse or data lake just falls aide when you look at the multiplicity of data sources inside and outside the organization. It’s all data, and it comes from many places. The data lakehouse is an idea designed to transcend the limitations of data warehouses and data lakes, but in reality, it’s just one more node in the enterprise data estate.

Our take is that the data mesh raises important issues. Let’s not get caught up in the notion of total transformation. As noted above, most organizations that have embraced cloud-native microservices-based architectures have done so only with a portion of the stack – the part that is the most dynamic. And we’d suggest the same for data meshes; we don’t see monolithic blocks of data for ERP systems suddenly getting refactored into the federated governance of the data mesh.

Furthermore, as most of the pieces for data mesh have yet to fall into place, the answer to the question of whether they are ready for prime time is that, at this point, they should be looked at for proofs of concept, and at that, with a few interrelated data domains.

Note: For a more thorough airing of data meshes, check out our ZDnet post.

Tony BaerNovember 17, 2021

Blog

Are Data Meshes ready for prime time