Data is the key ingredient for AI – but is it plain or self-raising?

2023 has been dominated to date by the emergence of publicly available Large Language Models (LLMs) like ChatGPT. The popularity of the OpenAI chatbot set new benchmarks for consumer adoption, with 100 million active users engaging with ChatGPT within eight weeks of launch, creating the fastest growing consumer application in history. The same milestone in consumer engagement took TikTok nine months and Instagram over two and a half years.

LLMs have become a gateway for improved recognition of AI. Listed companies have responded to escalating expectations amongst investors by rapidly incorporating AI-related nomenclature in company reports and updates, sometimes without a compelling commercial rationale.

One less recognised component AI’s glow up is the critical role of data. AI has a ferocious appetite for data. The intuitive processing algorithms that drive AI models cannot function without high quality data. Data centres are booming as investment surges to meet demand.

Many AI model developers are unable to easily access the requisite data to feed their algorithms. As AI matures and becomes ubiquitous with contemporary business operations, the current shortfall in data supply is likely to become acute.

To meet this burgeoning demand, a growing number of AI models are relying on digitally created datasets. This on-demand approach to synthetic data generation provides the scale and technical elements required to keep AI models sated.

While there is obvious utility associated with synthetic data, there are also challenges. Synthetic data relating to real-world applications is typically anchored around baseline sets of real-world data. But the distinction between synthetic and real is ambiguous. Data purchasers are reporting increasing instances of real-world data sets inadvertently containing synthetic data. In many instances the synthetic elements are difficult to identify, potentially compromising the efficacy of AI models.

Clearer commercial signals and better pathways to commercialisation would expand the supply of real-world data and create greater scope for more efficacious synthetic data development.

Research suggests that less than 5% of data holders regularly sell datasets or have a commercialisation strategy for their data assets. To unleash the untapped potential of existing real-world data, data holders need clarity around the current and potential value of their digital assets and a capacity to engage with a data purchasing market that is currently poorly defined, opaque and not well understood.

Aurum Data’s valuation and commercialisation solutions provide data holders with clear, actionable insights and solutions that realise value and spur innovation.