Blog

How Generative AI has reshaped the data and analytics world

How Generative AI has reshaped the data and analytics world

What a difference a year makes. At the beginning of the year, if you asked anyone outside the AI research community about Generative AI, you would have gotten a blank stare. Our first quarter briefings with data and analytics vendors barely made notice of Large Language Models (LLMs) or vector storage.

But in the meantime, ChatGPT, unleashed by OpenAI late last year, gradually became viral. In place of searching with keywords, people could type questions in conversational English and the computer could get you an answer. Forget about the fact that the quality of the responses from ChatGPT were on the level of high school seniors typing answers on their exams without citations. Being able to communicate with your computer in English captured everyone’s imaginations. Heck, it won't be a concept leap to add voice-to-text recognition and turn the whole thing into a Captain Kirk-like experience.

And not surprisingly, by April, the data and analytics community took notice. It’s not the first time that enterprise vendors had to take their cues from consumers.

Over the spring conference season, we had the chance to spend time with Databricks, DataStax, IBM, MongoDB, Oracle, SAP, SAS, Snowflake, and Teradata, and with some downtime recovering from a sports injury-related medical procedure this summer, we had the chance to collect our thoughts. Suddenly popping onto the agenda, what were the common themes that we heard? And once we put them together, we found a huge glaring omission.

The headline was that English language is becoming the most popular API. For global audiences, that will subsequently expand to the speaking language of choice. A couple weeks back, Google announced plans to scale translation of LLMs into hundreds of world languages. But we digress.

So, we saw a lot of examples of Generative AI turbocharging what we used to call “natural language query” into something more conversational. The result is a far less robotic experience; in place of the keywords or pointers to specific columns that drove the machine learning models to turn query and response into prose, Generative has the potential to make the process truly conversational. Emerging services, such as ThoughtSpot Sage, Snowflake Document AI, or Databricks LakehouseIQ could pick up from where Tableau Ask Data left off.

The other obvious conversational language use case is about coding. Generative AI can pick up where traditional autocomplete leaves off. We’ve seen a flurry of announcements from AWS (CodeWhisperer); IBM (Watson Code Assistant); Microsoft (GitHub Copilot); Databricks (English language Spark SDK); and others bringing out services that can do everything from filling out the missing piece of code to generating all the code from a declarative conversational request. Many of these services will also scan for bugs, security gaps, bias, privacy protection, and for databases, help structure them. No matter how many coders computer science programs are turning out, there’s never enough of them, and coding backlogs keep growing. Generative offers huge potential for a breakthrough here, but it will not replace human coders or make all their decisions. The flip side, of course, is that there are serious IP protection questions for code creators who believe that generative will rip off their IP. It’s much the same concern roiling creative content providers and is a major issue in the current Hollywood writers and actors strikes.

Don't fool yourself; Generative AI will require heavy iron. So it’s a good time to be Nvidia. Here’s the technical reason, and here’s what the street says. Generative AI will require a lot of data, but more specifically, relevant data. Given the hallucinations of generalized models, like ChatGPT, a common refrain we’ve heard is “your models with your data.” Let’s get a bit more specific. Most mortal enterprises will lack the expertise, resources, or time necessary to build their own models, so selecting from existing Foundation Models that have done much of the legwork will be the order of the day. And from that, you will either replace or augment those models, which is why Retrieval Augmented Generation is going to be the order of the day. And each time you run an LLM, you won't want to tokenize raw data from scratch; you’ll want to store and search it. That’s where vector databases and indexes come in. Vectors are the fact tables of LLMs. Sure, a few pure plays (e.g., Pinecone, Milvus) have emerged, but we expect in the long run that vector storage and indexing will be a feature added to popular databases.

Amidst all this, we found a huge gap. Sure, everybody states that Generative AI, like AI in general, must be governed, but few have a clue. Generative adds its own unique twists, such as figuring how to keep a process that looks sentient but is not, from heading off the rails. Answers provided by general-purpose services that scrape the Internet, like ChatGPT, all too often look like high school essays.

Making “classical” machine learning explainable has been enough of a challenge. Now, try explaining a generative model that outputs results through a long chain of mathematical next-most-likely-word probability computations. The industry is still figuring out how to monitor, mitigate, and document issues unique to Generative AI that only start with hallucinations and inconsistent answers. Inevitably, LLMs used by enterprises will have to document their sources to generate audit trails, but the models themselves will remain black boxes. Certifying those models will likely be an empirical process, and might benefit Generative Adversarial Networks that purposely tests with a “bad” model. We’re still at the beginning of this journey.

Note: This post is excerpted from an in-depth study on the impact of Generative AI on the data and analytics landscape. You can download the full report here.

 

Tony Baer