Whitepapers

Models Alone Are Not Enough: A Modern Data Architecture for LLMs

As the cost of operating Large Language Models (LLMs) decreases, so does the competitive advantage they can provide organizations on their own. This reality has been accelerated by companies like Hugging Face , Databricks Mosaic Research , and Meta releasing open source LLMs with fewer parameters to the public domain. While these foundational models are trained on less data than something like GPT-4, they can be highly effective when paired with the right architecture and data.

Speaking to this notion at The 2023 Data + AI Summit, Databricks Co-founder Patrick Wendell outlined how the cost of SaaS based LLMs has dropped 10x year-over-year. While this reduction significantly lowers the barriers to entry, it can be a double-edged sword. Wendell points out that as LLMs become more available and affordable to a given organization, they are also equally available to its competition.

This paradigm underscores that relying on models alone isn’t enough. To excel in a world where there is a wealth of inexpensive foundational models, organizations must bridge the distinctiveness of their data with the right architectural approach to create AI services that are truly differentiated in their ability to provide accurate and insightful information.

The Importance of Vector Data

Before diving into the various architectural components that can help organizations capitalize on their data, it is first crucial to understand how deep learning applications, like LLMs, interpret information. Central to this process is the transformation of data which involves encoding information into vector representations called embeddings. In an article entitled “The Rise of Vector Data”, Pinecone Founder & CEO Edo Liberty explores this concept by posing a thought-provoking question:

What happens in your brain when you see someone you recognize?

In answering this question Liberty draws a compelling analogy. Just as our eyes relay light signals to our visual cortex to translate neural activations into the recognition of someone we know, deep learning models process the world in a similar way. Liberty outlines that information is converted into vectors which can be used for predictions, interpretation, comparison, and other cognitive functions. With the right transformer models, the creation of vectors can happen across multiple mediums of data which is outlined in the visualization in Figure 1.

No alt text provided for this image
Figure 1. Sourced from “The Rise of Vector Data” by Edo Liberty.

 

Challenges at Scale

Vector embeddings lie at the heart of AI functionality powering applications like Generative Chat, Image Search, and Semantic Search. In today’s environment, where leaders are facing mounting pressures to deliver differentiated AI applications, vector embeddings are vital for mobilizing an organization’s unique data assets. However, implementing vector data poses new architectural requirements. Unlike conventional structured data, embeddings are comprised of mathematical arrays – often encompassing hundreds to thousands of numerical elements – that must be stored and accessed within a continuous vector space. Most databases aren’t inherently equipped to handle this data type efficiently or at scale. This obstacle becomes even more daunting when considering the expansive needs of a Fortune 500 company, which might require the rapid generation, storage, and retrieval of hundreds of millions to even billions of embeddings – each with hundreds to thousands of numerical elements – to provide LLMs with the right information.

A strong example of this can be seen through McKinsey & Company ‘s recent development of Lilli which empowers internal team members to access the firm’s 100 year knowledge base through a Generative Chat interface. According to a recent McKinsey article, this knowledge is made up of more than 100,000 documents and interview transcripts across a network of experts in 70 different countries. When considering such an expansive scale, the inherent challenges become evident. Centralizing this wealth of historical data, ensuring rigorous data sanitization, transforming vast amounts of information into vector embeddings, maintaining a link between these embeddings and their original metadata for enhanced transparency, and constructing an infrastructure robust enough to deliver real-time responses are just a few aspects that need to be considered in the architectural design.

A Modern Data Architecture for AI

During the 2023 Data + AI Summit, Pinecone Developer Advocate Roie Schwaber-Cohen gave an excellent presentation on a modern AI architecture incorporating services from Databricks, Pinecone, and Hugging Face. We’ve taken inspiration from Schwaber-Cohen’s comprehensive architecture, covered in Figure 2, because it solves many of the aforementioned challenges and enables organizations to differentiate AI services through their infrastructure and data.

No alt text provided for this image
Figure 2. Based on Schwaber-Cohen’s “Scaling AI Applications with Databricks, HuggingFace and Pinecone

 

Lets break down the different components of this proposed architecture:

Databricks Lakehouse: The Databricks Lakehouse Platform addresses a wide range of use cases and challenges. It provides cloud hosting versatility, scales seamlessly with accelerated GPU clusters, promotes inter-departmental collaboration via its multi-lingual notebooks, and enhances ML Operations (MLOps) and Governance. Its inherent scalability makes it ideal for compute-intensive tasks like generating embeddings. At its foundation, Databricks is built upon three groundbreaking open-source projects: MLFlow, Apache Spark, and Delta Lake.

Databricks Delta Lake: At the core of this architecture is Delta Lake, enabling the processing, preparation, and management of large data pipelines at scale. In adhering to the age old rule of “garbage in garbage out,” Delta Lake helps ensure a foundation of high quality data before generating vector embeddings. With the integration of Delta Live Tables, raw data can be continuously processed, turning it into a stream that undergoes meticulous sanitization and segmentation. Instead of processing large documents in their entirety, Delta Live Tables empowers the segmentation of them, sentence by sentence. Additionally, Delta Lake enhances governance with auto-validation and lineage features, tracing each data point from start to finish which is critical for compliance. Using models from companies like Hugging Face, OpenAI, or MosaicML, embeddings can be created directly in Delta Live Tables where they are saved downstream and ready for consumption into the Vector Database.

Embeddings Model: Many embedding models are available from providers such as Hugging Face , Databricks Mosaic Research , OpenAI , and Cohere . Since Vector Databases, like Pinecone, are agnostic to the original medium of data—be it text, audio, image, video, etc.—it’s crucial to choose an appropriate embeddings model tailored to the specific type of source data. For instance, if the objective is to transform text into embeddings, the MosaicML Instructor XL model might be a suitable choice, especially considering MosaicML’s recent acquisition by Databricks. Additionally, for applications requiring a seamless integration between visual content and natural language, a multimodal model like OpenAI’s CLIP emerges as a particularly effective option. For example, in industries like Retail & Consumer Packaged Goods, this functionality can significantly enhance a company’s ability to incorporate product images to improve discovery and search functionalities for their consumers.

Pinecone: Pinecone stands as a leading pioneer in the vector database domain, specifically engineered to enable organizations to manage their vector embeddings that fuel AI applications. From a scale perspective, Pinecone offers the capacity to handle anywhere from hundreds of millions to billions of embeddings while maintaining high performance with low latency queries. Furthermore, it is much more than just a storage solution. Its capability to support processes like Retrieval Augmented Generation is invaluable for enterprises – especially in regulated industries – as it reduces model hallucinations and provides LLMs transparent and citable information that’s grounded in an organization’s historical knowledge base. Moreover, features like metadata storage and filtering are integral, as they enable users to attach associated metadata, ensuring that every embedding can be reliably traced back to source data and referenced in the future.

In today’s rapidly evolving technology landscape, it is evident that not all AI services are built alike. The mere adoption of LLMs does not equate to gaining a competitive edge. As LLMs become more accessible, the differentiator lies not just in the models, but in how organizations architect their systems and operationalize their data. Central to this is the integration and management of vector embeddings, which are the lifeblood of the Generative AI revolution. The transformation of raw data into meaningful vector embeddings and their subsequent storage and retrieval demands a carefully designed architecture that can operate at scale and ensure data quality. By bridging historical data assets with the right architectural approach, enterprises can deliver highly differentiated services that are supercharged by their unique intellectual capital.

Rab Bruce-Lockhart

Chief Revenue Officer

11:25 – 24th January 2025

Ready to discover more?

Contact us and we’ll set up a video call to discuss your requirements in detail.