Data 2024 Outlook: Data meets Generative AI

At the beginning of last year, who knew that Generative AI (Gen AI) and ChatGPT would seize the moment? A year ago, we forecast that data, analytics, and AI providers would finally get around to simplifying and rethinking the Modern Data Stack, a topic that's been near and dear to us for a while. There was also much discussion and angst over data mesh as the answer to data governance in a distributed enterprise. We also forecast the rise of data lakehouses. For the record, last year’s predictions are here and here. Turns out, many of them came true, but one thing we didn’t predict was the emergence of Gen AI.

So how will all this play out in 2024 for data?

On the database side, we’re seeing a flight to safety. There is scant appetite for new database startups in a landscape that still counts hundreds of engines, but shows the top 10 most popular ones remaining largely stable. The usual suspects in the relational world, including Oracle, SQL Server, and various dialects of MySQL and PostgreSQL, continue to dominate, along with giants in the relational world from MongoDB, Elastic, Redis and the hyperscalers.

We’ll go out on a limb and state that the longer tail have limited prospects for growth. Couchbase is a good example of a second tier player that, having recovered from a lost decade, has managed to eke out respectable growth, but will never catch up in market share with MongoDB, with which it once vied. Beyond this group, we see scant prospects for 2010s-vintage startups like CockroachDB, Yugabyte, or Aerospike displacing the established order.

Gen AI will drive database innovation in several ways over the coming year. While specialized vector databases like Pinecone and Milvus have emerged, we believe that the bulk of the action will occur in the operational databases that enterprises already use. We believe that vector storage is a feature. Growing reliance on Retrieval-Augmented Generation (RAG) requires a way to persist the vector embeddings that foundation models require and search it.

But the types of vector indexes will start to vary, and this is where databases will differentiate for Gen AI. Most databases adding vector storage are starting out with basic vector indexing that is not yet optimized for stringent SLAs. That is about to change. But there are different ways to optimize the similarity searches that Gen AI queries perform. You can go for low-cost “low recall” searches that provide quick and dirty answers, or “high recall” where you need more comprehensive results, and then there are variants that deal with storage footprint (in-memory vs. scale out, distributed), and so on. We also expect innovation for in-database orchestration of generative queries that also pull in tabular data.

Data and AI governance will start converging with data and model lineage. Today, data governance and AI governance are separate toolchains, run by different practitioners. The challenge is messy, especially since data governance in most organizations is disjoint and overlapping, where it occurs at all. In fact, the data governance muddle is what gave rise to all the hand-wringing over data mesh, and data products over a year ago, but that's another can of worms.

Meanwhile, AI governance has emerged in spurts, focusing on tracking model lineage, auditing, risk management, compliance, and in some cases, explainability. Gen AI has compounded the challenge, requiring more attention to citation of data sources and introduces new issues such as detecting (and enabling deletion) of toxic or libelous language, to provide a couple examples.

The challenge of course is that with AI, models and data are intertwined. The performance, safety, and compliance of a model is directly linked to the data it has been trained on, and the data on which it is generating answers. There is the question of drift; data and models can drift independently or interdependently. Data sources may change, and the trends in what the data is revealing may require the model in turn to adapt. You don't want to be solving yesterday’s problem with today’s data or vice versa. In the coming year, we expect AI governance tools to start paying attention to data lineage because that is the logical connection point. It is the point where the audit trails can begin, assessing which version of which model was trained on what version of what data, and who are the responsible parties that own and vouch for those change.

Gen AI will continue enriching data discovery and governance. Transforming natural language query from keywords to conversational prompts is the low-hanging fruit, and we are already seeing early examples such as ThoughtSpot Sage, Databricks LakehouseIQ, and Amazon Q in QuickSight that pick up where keyword-oriented predecessors like Tableau Ask Data left off.

We expect that natural language will come to a variety of functions around the blocking and tackling associated with the data lifecycle, from cataloging data to discovering, managing, governing, and securing it. Atlan, a data catalog provider focusing on DataOps provides good glimpses of what to expect. Starting with conversational search, it automatically discovers database metadata and can spit out documentation in plain English. On the horizon, we could imagine this capability being extended to pinpointing gaps or omissions with reference data for example, or highlight places where risks, such as exposing PII data, occur. That’s just for starters.

GenAI will streamline database design. Following in the footsteps of automatic code generation or guidance, Gen AI could scan requirements documentation for an application and spit out a candidate schema and E-R diagrams. And by scanning the actual corpus of data, Gen AI could spin out synthetic data. And it could optimize index generation based on natural language queries. Again, these are just a few examples.

Longer term, we could see Gen AI supplementing database tasks that ML is already performing, such as with index creation, error and outlier detection, and performance tuning. But let’s not get carried away, because these benefits will be far more incremental.

All this makes us wonder. What will we get surprised by in 2024?

This post is excerpted from our year-ahead forecast published in Silicon Angle. Click here for the full report.

Tony BaerJanuary 3, 2024

Blog

Data 2024 Outlook: Data meets Generative AI