What we’ve learned from Hadoop

The spring cleaning of dormant Hadoop projects touched a nerve. It’s been fashionable to say that “Hadoop is dead” for some time – at least since Gartner published studies showing declining use as of 2015. ZDnet colleague Andrew Brust’s post on the project purge went positively viral.

We spent almost the entire decade of Hadoop at Ovum, and wound up putting together an extensive research portfolio tracing Hadoop’s rise and descent. So, we thought this would be a good opportunity to mine the history from those reports and construct a narrative. Based on that narrative, last week we put out a post on Hadoop’s legacy. Credit Hadoop for taking down our fears of what we used to term big data and triggering a virtuous cycle of innovation that has resulted in a rich landscape of the data-oriented analytic services we have today.

Now’s the next step – looking at the history of Hadoop, what lessons can we learn as we dive into the coming era where data, compute resources, storage, services, and bandwidth are plenty in a world where the default for analytics compute is becoming cloud-native. More specifically, what lessons have we learned about the nature of innovation in the data space? How we can judge innovation, and be open to failing fast? And what will be the role of open source?

Keep in mind that, in the grand scheme of things, taking advantage of innovation is small potatoes compared to the meta challenges of fairness, ensuring fair representation beyond the narrow world of white and Asian males that will define our missions. In this post, however, we’re sticking to the challenges of digesting open source innovation.

Looking over the decade, it was a routine of two steps forward and one step back. That’s to be expected. Innovation won’t roll out only based on safe bets; there must be challenges to extend the envelope, and the need to fail fast when some paths dead end.

Rewinding the tape to the early 2000s, Google shared the insights, but not the technology of its innovations when it published research papers on the Google File System and MapReduce. Those innovations were pivotal in addressing two critical limitations to analytics: you could turn storage into a cheap utility with a file system designed for distributed storage on commodity hardware, a far more cost-effective alternative to the EMCs and NetApps of the day (initially truing emulation, they’ve since moved to more tiered-based approaches). And with MapReduce, you could solve the compute bottleneck with a scale-out architecture that delivered the almost perfect linearity elusive from the scale-up architectures that were prevalent in the day. Analyzing at multiple terabytes and petabytes was now within the range of the possible.

But how did we do it? The answer reveals a lot about how the state of best practice constantly changes. At the time, the data grew so massive that it caused bottlenecks with the existing n-tiered architectures of the day; so instead we brought compute to the data. Fast forward a decade, and cloud infrastructure has matured to the point where you have fast enough connectivity, compute instances optimized for specific workloads, and microservices, that it made elasticity possible – why not pay only for as much or as little compute as you need? It became time to once more separate compute from storage. We weren’t wrong the first time. It’s just that the changing state of the art changed the right answer. Lesson learned? Don’t assume that any new best practice is set in stone. Hold that thought when we touch base on schema on read later in this post.

What about the role of open source? More specifically, the model applied to Hadoop was community- rather than vendor-based open sourced. We’ve written extensively on the differences between the models. Community-based open source helped Hadoop go viral with the same early adopter organizations that a few years earlier embraced grid computing. Maybe Google didn’t share GFS or MapReduce, but its research allowed the folks at Facebook. Yahoo, LinkedIn, Twitter, and other digital natives to engage in clear room development and contribute to open source. And it proved auspicious for Doug Cutting and Mike Cafarella, who at the time were developing a hyperscale search engine, to organize the community under this new project that Cutting named after his young son's elephant doll.

Community-based open source was key to Hadoop's emergence, but it was also key to Hadoop’s obstacles. The upside was that no vendor controlled the technology, but that was also the downside. With community open source, you had a meritocracy based on crowdsourced innovation; that’s the way the process is supposed to work. Projects get accepted based on their technical merits and their ability to draw a sufficient cadre of contributors.

But given the competitive dynamics that also emerged in the community, the project often got distracted from the most important challenges facing it. Hadoop, as a collection of multiple disparate projects, often termed “zoo animals,” proved stubbornly difficult for IT organizations to implement, even when they got commercially supported distros.

Instead, we wound up with a competing array of overlapping projects in areas ranging from streaming to security, data governance, and data access. For instance, at one point, there were well over a dozen separate SQL-on-Hadoop frameworks, both open source and otherwise. Competition should be a good thing – it provides incentive for project teams to design better mousetraps. But in the Hadoop ecosystem it also led to confusion and distraction from facing what was the platform’s biggest drawback: it’s sheer complexity. To developers, it’s typically more exciting to pursue new frontiers than the blocking and tackling of getting things to work together or make them easier to use.

The peaking of Hadoop – and related technologies – also has something to teach us. In 2014, the year that the Strata conference debuted in New York's Javits Center, Mike Olson in his keynote called for Hadoop to ultimately “disappear.” It was a prophetic comment. He didn’t mean for Hadoop to disappear from the market, but instead, to hide in plain sight as the engine under the hood that is supporting your data science and delivering your analytics. It eventually happened, but not before competing pathways eclipsed it.

That year, we saw an explosion of third parties on the expo floor as new ecosystems of tooling, designed for the diverse, undisciplined data sets that comprised Hadoop emerged. This is the classic case of two steps forward, one step back. We saw emergence of tools addressing data preparation and data cataloging, and while many of the vendors pioneering these technologies many have since been acquired, the capabilities have become table stakes because they addressed real needs: the need to bring self-service to making data consumable and for mapping out the data assets are sitting in the data lake. Others, such as the first wave of BI-style analytics tools specially designed for Hadoop, eventually landed with a thud because, as it turned out, business analysts were more excited about self-service tools and expected them to add capability to get inside the data lake.

As we noted in our ZDnet post, the innovation unleashed by Hadoop literally fed on itself. This is the theme that would take over the dialog. As Hadoop proved that big data was possible, it opened the floodgates to new approaches tackling the speed bumps that typically crop up with first generation technologies. This is the virtuous cycle that were talking about at the top of the post.

Spark was the first shot – it addressed one of Hadoop’s biggest vulnerabilities, which was the lack of real-time or interactive processing. MapReduce solved the riddle on linear scalability, but Spark took it a step further by finding new approaches to marshalling data and exploiting memory, which was becoming more viable at scale thanks to ongoing price/performance trends.

But Spark, too, was initially overhyped as the souped up MapReduce replacement that could do just about everything: predictive analytics, model building, sensor data processing, real-time fraud detection, and data engineering. As it turned out, data engineering proved the sweet spot as Python developers had other ideas (Python itself has a rich portfolio of compute libraries) and many AI workloads were too compute-heavy to be efficiently handled by Spark (which was more IOPS-focused). Streaming alternatives, whether from Flink, an open source project that was designed expressly for it, or cloud vendor provided alternatives, also emerged. But even with Spark’s narrowing of focus, don’t weep too many tears for Databricks, which currently has over a billion dollars in the bank. Spark’s success proves that, even when the hype dies down, if it provides the right solution to the right set of problems, new technologies can certainly thrive.

Since then, we’ve seen a multitude of other options emerge. In a world where multimodel databases are now a checkbox, cloud-native (with its separation of compute and storage; refactoring into containers and microservices; and emergence of Kubernetes) becoming the de facto standard for NextGen cloud services, the cornucopia of paths to what we used to term “big data” analytics have multiplied – allowing enterprises to choose the right tool for the problem. If your workloads are varied, you can choose a multi-purpose analytic service; if instead, you have different groups solving different problems, with different sets of skills, fit-for-purpose services (e.g., Spark, streaming, AutoML, etc.) may fit the bill.

The good news is that, with compute and storage separated, you can have it both ways. The world won't be fixated on the trials and tribulations of a single community of projects.

But instead of a 16-ton elephant, there's now a 16-ton gorilla in the room, as there will be the question of governance. This is where we’re going to be seeing a lot of back and forth, from centralized to decentralized models. There will be some lessons anew to be learned here, as the state of governance over what are loosely called data lakes (which can be considered as narrowly as what sits in cloud storage, or as broadly as what sits anywhere, including the edge). We’re seeing some interesting first steps with tagging standards (tagging could provide the trail of breadcrumbs that allows organizations to apply governance) such as Apache Atlas: IBM, Microsoft, and Cloudera are supporting it, while Snowflake is going off in its own direction on tagging.

Meanwhile, we’re seeing a lot of noises over a new way of managing the lifecycle of data with Data Meshes. This is more about practice than technology – where individual groups take ownership of data and manage the full lifecycle as data products. But there will be clear technology ramifications when we deconstruct the imperfect and immature model of managing (or not managing) the data lake. And by the way, remember when we mentioned schema on read above? Depending on how you interpret the requirement that data should be treated as product, that might imply that data be transformed when it goes into the lake if you’re taking data meshes seriously. Or maybe not. But just as with storage, that illustrates that no technology design pattern or best practice is set in stone.

So, what have we learned from Hadoop? As we look toward a dispersed data future in the cloud, perhaps the best lesson we can learn from the trials and tribulations of Hadoop’s evolution and devolution is to keep focused on the problem without getting hung up on the which way the technology popularity contests of the moment are headed. And, while open source will continue to matter, as we’ve pointed out in a recent post, the experience that the product or service provider will matter even more.

Tony BaerMay 3, 2021

Blog

What we’ve learned from Hadoop