Unstructured data takes varying forms. It’s typically text-heavy, but may contain data such as dates, numbers, and dictionaries as well. Data Engineers commonly encounter unstructured data in the form of deeply-nested jsons. However the term “unstructured” data really refers to anything non-tabular; in fact, over 80% of the world’s data is unstructured.
While unstructured data may seem innocuous to us data practitioners, it’s making huge waves at a macro-level. Indeed, GPT Models are all trained on unstructured data. This was correctly observed by Tomasz Tunguz in a recent article on Snowflake’s Earnings Call:
It might seem odd to view unstructured data in this financial and macroeconomic context. My first job was in Investment Banking, so I’m nostalgic when it comes to reading stuff like this. “Unstructured data is the growth engine” could make sense to me — it sounds like a really big market tailwind!
But it’s been a while since I’ve been aligning Powerpoint boxes. Conceptually, unstructured data is now a deeply-nested json waiting to be processed. But it’s clear from the earnings call unstructured data isn’t now just JSONs (was it ever?) but text, documents, videos, and the like.
What’s emerged is that this data powers some of the most soon-to-be critical use-cases, and where it’s processed is of paramount importance to the two heavy-hitting companies in the data world: Databricks and Snowflake. Let’s dive into why.
GPT Models feed on data. Specifically, they feed off unstructured data. These are things like text documents, html files, and code snippets. As companies increasingly look to implement LLMs in production, the value of processing this data increases because its demand increases. Therefore, its value to vendors like Snowflake and Databricks increases.