Uncovering the Financial Costs Behind the Generative AI Revolution

By Kunal Agarwal, CEO and Co-Founder of Unravel Data 

A recent survey of over 2,800 IT professionals by technology publisher and training company O’Reilly revealed that nearly two-thirds of respondents said they are already using generative AI in their business. “We’ve never seen a technology adopted as fast as generative AI—it’s hard to believe that ChatGPT is barely a year old,” the company said in its report.

Further, even among those enterprises still holding out, the push for generative AI adoption is growing. The ‘top priority’ for two-thirds of 1,000 companies, each with over $1B in revenues, surveyed last year by AI Infrastructure Alliance is the adoption of LLMs and generative AI by the end of the year. What we are witnessing today is an accelerated ‘AI arms race’ among major enterprises, and the armaments of their choice are modern cloud data platforms like Databricks, Snowflake, BigQuery, and Amazon EMR.

These data platforms have made it possible for companies to put practical, production-grade AI/ML, including LLMs, into the hands of more users from a wider variety of business departments in every company. While department heads are naturally gung-ho about using generative AI, it naturally leads to increased utilisation of cloud, and puts a lot of financial pressure on the business from the C-suite downwards. The reason? Generative AI workloads are complex, depend on massive data workloads and are therefore hugely expensive, as company after company is finding out.

McKinsey says that the total cost of ownership (TCO) for three different LLM archetypes can run between $2 million and $200 million, depending on whether the company is a ‘Taker’ or using an off-the-shelf LLM model with little or no customization, a ‘Shaper’, or customising an off-the-shelf model to integrate with internal data and systems, or a ‘Maker’, building and training their own foundational model from scratch.

Setting aside millions of dollars for these AI projects, even in the Taker or Shaper form, is clearly a huge bet for a majority of companies. Most of them are already struggling with escalating cloud data costs and their leaders have to walk a tightrope between increased demand for these petabyte-scale data workloads and their financial and human resource capabilities.

This explains why cloud data platform owners and data teams are becoming more focused on “ROI consciousness” to ensure that the organisation is realising the greatest value from its cloud data investments. This approach however necessitates knowing the true costs of AI initiatives, and a lot of these costs are hidden. We’ve seen that companies that mitigate these hidden costs find that they already have enough resources—human and financial—to run as much as 50% more AI workloads.

In order to do this, companies first need visibility into costs at a granular level. A lion’s share of the cost for AI models goes into building and running the data pipeline(s) to train the models. For a highly complex AI model pipeline, that’s probably 10,000s of individual jobs. And each job carries a price tag. Without visibility into how much each individual job costs, companies cannot uncover their biggest hidden cost of AI—data pipeline inefficiency.

Any company’s monthly cloud data bill includes a significant amount of self-inflicted overspending that adds up to millions of dollars of “waste” over the course of a year. Eliminating this overspending goes a long way to funding additional AI projects without increasing the budget.

But identifying exactly where the overspending and waste are happening can be a backbreaking, time-consuming effort.

It’s really all about tackling inefficiencies from two aspects—infrastructure and code—where they are first introduced, down at the individual job or user level. With hundreds, if not thousands, of individual users running data jobs—from PhDs to interns and everything in between—the sheer number of inefficiencies that can inadvertently creep into AI applications/pipelines is enormous. And the more people running data jobs in the cloud, the greater the amount of waste and overspending.

What usually leaps to mind when talking about waste and overspending are infrastructure costs. To be sure, we see that most organisations spend at least 30% more on cloud data infrastructure resources than is necessary. (It’s probably even higher in GenAI projects.) The number, size, or type of resources requested by individual users is simply more than is needed to run the job on time, every time.

But overspending and waste—cost inefficiency—also occurs deeper under the hood in the code. How the thousands of data jobs are configured and coded impacts how long the AI applications/pipelines take to run. The metre is always ticking in the cloud; inefficient code leads to higher monthly cloud data platform bills, and code issues are a leading reason these expensive data pipelines fail in production (and need to be run all over again).

Eliminating infrastructure and code inefficiencies involves capturing millions of telemetry data points, correlating everything into a meaningful context for the task at hand, and then analysing it all to find a better way to do things. Trying to do this “by hand” takes a lot of time, effort, and expertise. Cobbling together all the necessary details is hard enough; figuring out what is all means (and what to do) is an even steeper hill to climb. Usually it requires diverting your top talent to firefighting.

This is a task perfectly suited for automation and AI. Auto-discover everything running in GenAI pipelines, end to end, and learn how it all works together and identify where and when something could be done more efficiently. It’s something AI does at speed, scale, and accuracy not even the most experienced engineers can match: absorb huge amounts of information and make sense of it faster and more thoroughly than is humanly possible.

Generative AI may be the biggest gold rush at the moment, but it requires a significant investment of resources, and its hidden costs can have serious repercussions on a company’s ability to compete. Uncovering and correcting these hidden costs is a key differentiator that will ultimately separate the winners from the also-rans.

CloudGenerative AI
Comments (0)
Add Comment