Whitepapers

Tapping into the power of unstructured data through NLP

In 1970, IBM Researcher Edgar Codd wrote one of the most important papers in the history of computing, A Relational Model of Data for Large Data Banks, becoming the foundation for the Relational Database and Structured Query Language (SQL). Despite the significance of Codd’s findings, his research was largely ignored as IBM was commercializing a different type of database, trapped in a true innovator’s dilemma. However, his paper caught the attention of Larry Ellison, serving as the inspiration for him to start Software Development Laboratories, and then Oracle.

The Relational Database is so important because it fundamentally revolutionized how software is developed and how organizations sort, search, and analyze data. Nearly 50 years later, another historical paper was produced out of Google Brain’s research department and the University of Toronto titled Attention Is All You Need. This research introduced the world to the Transformer Architecture which has had a profound impact in Machine Learning (ML) and Natural Language Processing (NLP).

While recent ML applications like ChatGPT, Bard, or DALLE have generated headlines (literally and figuratively), the potential to leverage large language models (LLMs) to tap into the power of unstructured data could quietly revolutionize how enterprises manage information and develop applications. We will explore three NLP applications that show promise in supporting this transformation.

Logistic Regression

Applying Generative AI to Sales

Improving Search

As consumers, one of the ML applications we’ve become most accustomed to is Search. Search Engines like Google enable us to discover information at a rapid pace. However, these Search Engines are based on incredibly advanced supercomputers, networks, and huge quantities of training data that would be near-impossible for most organizations to replicate. Because of this, it is difficult for companies to make the employee experience of searching for internal information as seamless and efficient as a consumer using a search engine like Google or Bing. Given so much of organizational data is unstructured, this poses a knowledge discovery problem where valuable insights are often locked in documents, emails, support chats, meeting notes, internal wiki pages etc.

In a sense, it is the modern equivalent of trying to discover a single piece of information buried in a book at a public library vs utilizing Semantic Scholar to discover relevant sources. However, the Transformer Architecture and advances in NLP have shifted the paradigm, making it possible for organizations to tap into search capabilities that work like a Google or Bing would for internal information. This capability can have a material impact across industries, enabling early adopters to gain a competitive advantage.

From how legal professionals search through huge quantities of information during case discovery to how scientists research clinical studies and papers in pharmaceutical development cycles, the power of semantic search will help organizations discover and innovate faster. These advancements not only have the potential to improve the way information is discovered within an organization, but also how developers enhance the search functionality within their own products to improve the overall customer experience.

The strategic importance of search has been underlined this week with Microsoft CEO, Satya Nadella, announcing the reinvention of Microsoft Bing & Edge with AI-Powered Search through their OpenAI alliance, and Google CEO, Sundar Pichai, announcing the upcoming release of Bard which will see new AI-Powered features embedded into Google Search.

Extracting Information

Another organizational challenge in trying to use unstructured text is being able to extract the right information in a scalable way. This is difficult as unstructured text exists across multiple languages and sources, instead of being stored in tables with a specific schema. However, advances in LLMs have made it easier for organizations to identify and extract critical information from unstructured text like entity names, addresses, keywords, topics, etc. Once the relevant data is extracted, it can be further used to improve various processes and enrich other language models. From a business application perspective, improving supply chain visibility is a direct example of where text extraction can create value.

As supply chains become increasingly complex, it is common for organizations to lose sight of their distributed network beyond their direct suppliers. This poses a problem as organizations’ direct suppliers often depend on an extended network of other suppliers to provide critical materials, making it difficult to predict disruptions happening further down the value chain. Research coming out of the University of Cambridge explores how NLP could help combat this issue by extracting entity names from public records, such as news articles, and using these data to classify buyer-to-supplier relationships to enhance supply chain mapping solutions.

This example is particularly relevant given the disruptions resulting from the coronavirus pandemic, which exposed many weaknesses in global supply chain networks. A recent survey from McKinsey & Company emphasizes this need, highlighting that improved visibility was the topic that Supply Chain Executives listed as their top priority for digital investments.

Classifying Data

At its core, classification enables users to tap into the power of context and understanding by bucketing text into specific categories. Advances in the commercialization of LLMs are enabling organizations to use powerful classification capabilities to discover new insights and patterns in text, and embed the power of language understanding into digital products. The above example of Supply Chain Visibility utilizes the power of classification to better understand the relationship of buyers and suppliers, but classification models can also be applied to numerous other use cases like recognizing customer intent, performing large scale sentiment analysis, understanding customer reviews, providing product recommendations and so on.

For instance, if we think about customer support through the lens of an airline, the combination of classification models and other ML capabilities can be integrated into support processes to perform analysis on emails, chat bot inquiries, and even incoming customer calls through the real time analysis of voice-to-text transcriptions.

This can enable airlines to resolve more cases through self service interfaces and route incoming customers to the right support specialist dynamically; for instance, you may need to speak to someone about a lost bag instead of rebooking a canceled flight. This example can create value by making support functions more scalable and resilient, help optimize hold queues, and improve experiences by resolving customer conflicts more quickly.

Future Impact

There are numerous NLP-based applications that organizations can adopt to harness the power of unstructured data. With the increasing accessibility of NLP through consumable APIs and open-source models, organizations can orchestrate multiple capabilities in parallel to enhance different parts of a process. While we are still very much in the early days of the space, the impact of the Transformer Architecture has helped propel significant advances in the field of ML. As innovations in NLP continue to accelerate, the potential impact on managing and analyzing data could very well be of similar magnitude to that of the Relational Database. The question is will organizations hesitate to capitalize on these innovations as IBM did with Codd’s research, or will they execute on the opportunity to build generational products as Ellison did?

Rab Bruce-Lockhart

Chief Revenue Officer

15:43 – 20th January 2025

Ready to discover more?

Get in touch