Data Strategy

Challenges and pitfalls of using large language models on proprietary enterprise data

Chatbot-based Q&A systems quickly became the number one large language model use case in all industries. They are usually powered by a technique called retrieval augmented generation. We show typical limitations and pitfalls of this technique and how to overcome them.

Paper man sitting in a paper house surrounded by paper books, reading.
8 minutes to read
With insights from...

More than one year after the release of ChatGPT, we can easily identify the Large Language Model (LLM) use case, that has the most impact: Chatbot-based Q&A systems on proprietary customer data. The demand for it stretched across all industries. Together with an insurance company, we built a system that can answer questions like “Is my watch covered by my current policy?”. With a bank, we developed a solution which answers questions about regulations. Lastly, with a telecom provider, we built a system which helps deal with everyday requests from their customers.

The technology behind all these use cases is called Retrieval Augmented Generation (RAG). It combines an LLM with a retrieval system, so the LLM can search for relevant information outside of its training data, to then generate an informed response to any query. The approach allows to easily update / extend the knowledge of a model without fine-tuning and drastically reduces hallucinations. It already manifested itself as the key ingredient to many new LLM-powered products, most notably Microsoft Copilot.

While these products can provide value for companies in some scenarios, we often observed situations in the past in which they fell short:

  • An off-the-shelf solution may not produce satisfactory outcomes, such as providing incorrect responses, particularly if the application involves more complex requirements, such as domain-specific data or inherent structures within the data.
  • The data may reside in a location that is inaccessible to standard, off-the-shelf tools due to technical and/or legal constraints, such as proprietary software environments or on-premises infrastructure setups.
  • The data may be of heterogeneous type, such as databases, diagrams, or forms, which might prevent it from being used by an off-the-shelf tool.
  • The licensing scheme and cost structure may not align with the requirements of the organisation.

A custom solution, built using RAG, may effectively address these constraints and allows your company to still profit from the advances of generative AI in the last months and years. Depending on the situation, this can be an extension of existing solutions (for example the OpenAI Assistants API) or an even more bespoke solution, potentially using open-source large language models deployed on any infrastructure.

In the remaining part of this blogpost, we will first discuss the basic principles of RAG from a technical perspective. We will then identify the shortcomings of the basic approaches and show how we overcame them in different projects.

Example use cases of RAG in Insurance, Industrial Sector, Banking, Telecommunication and Pharma sector. Exemplary use cases for RAG systems

How does RAG work?

RAG combines an LLM with an information retrieval system. For every user question, this system is first used to find information which might answer this question. Then, this information is fed together with the user question to the LLM, so it can generate an informed response.

The retrieval system is usually built using so-called embeddings. Embeddings take a snippet of text and put them into a mathematical vector space. They do this in a way that places texts that talk about similar things close in this vector space. This allows to search for information by embedding all the details and the user query. We can then identify the texts which are closest to the query in the vector space. These texts likely contain information which is relevant to the query.

One important ingredient of this approach is the so-called chunking. What we do there is take our texts and split them into smaller parts (chunks) of, for example, a few hundred words each. Then, these chunks are embedded separately, so we cannot just search for a whole text but for relevant parts of it only.

Graphic basic system RAG: First step: Rich information is stored in retrieval system: arrow from "proprietary data" to "Database/vector store". Second step: User question is used to search for relevant information: arrow to database/vector store. Third step: Most relevant results are identified: arrow from database/vector store to context. Fourth step: Question is fed to LLM together with relevant information: arrow to LLM and also arrow from "context" to "LLM". Fifth step: LLM responds: arrow to user. A basic RAG system

While this simple setup works surprisingly well in many situations, it falls short in others. Usually, the following challenges can be identified:

  • Embedding-based search often fails: While these embeddings are great in correctly capturing the meaning of synonyms and the like, they are not perfect. For certain types of data, certain embedding models are even quite bad, for example legal text or company names. Thus, as your information base increases in size, the likelihood of finding the correct chunks decreases with it.
  • Chunks miss context: Even if the correct chunk is found, it is still only a small part from an overall text. The missing context around it might lead the LLM to interpret the content incorrectly.
  • One-shot approach prevents proper search: The retrieval system has one chance of finding the correct information. If the user typed in the query in an unusual manner, it might fail. If the found information requires a follow-up question, it can't answer it.

How to improve RAG performance

In the cases mentioned before, the model either does not provide an answer at all, or worse, it provides the wrong one. Thus, there are multiple extensions or adjustments of the basic RAG setup which, depending on the problem, can help.

Basic LLM systems with additional potential improvements marked on the graphic in green. On the first step: Better chunking. Third step: Improved search. Fourth step (LLM): suitable model and agents Potential improvements in RAG systems

Suitable model

Unsurprisingly, your model selection heavily impacts the final performance. LLMs have a limit in how much information they can process at once. Thus, we can only provide it with a certain number of chunks to help answer the question. Using a model with a larger context length is an easy way to improve a RAG system since including more chunks decreases the likelihood of missing the relevant ones. Luckily, newer models drastically increased the context length, GPT-4 for example was updated in November 2023 to support 128’000 tokens already, which corresponds to roughly 200 pages of text. And in February 2024, Google announced a breakthrough in their Gemini series, soon supporting one million tokens.

However, there is a caveat to this solution. The longer the context, the harder for a model to find the relevant information in it. Furthermore, it also depends where in the context the information is provided. For example, information in the middle is usually weighted less. Last but not least, inference with very long contexts requires a lot of resources. Thus, a trade-off must be made based on careful evaluation.

The context length is not the only selection criteria for your model. If you have highly specific data, for example medical documents, a model trained on such data might outperform a more standard one, even though it might have a smaller context length.

Better chunking

Chunking, the process of cutting the information into small parts which are searchable, heavily impacts the retrieval performance. If the chunk is too small, the LLM has a hard time interpreting it because of the missing context (the original text). If a chunk is too large, the embedding vector will become very general because the text starts to contain different topics, and so the search performance decreases.

Again, an evaluation specific to the use case must be made to find the optimal chunk size and overlap. Furthermore, adaptive chunking can be used, so that not all chunks are the same size. Adaptive chunking usually considers document structure (e.g. paragraphs) and might use embeddings again to measure the similarity in discussed topics between different passages of the text.

Improved search

As already discussed, embedding-based search has its limits. A common way to improve it is to combine it with more classical, keyword-based search. This combination, often called hybrid search, usually outperforms pure embedding-based search.

Since the number of chunks which can be fed into the LLM is limited, there is often a reranking step happening. More chunks than could fit into the LLM context are first retrieved and then the most fitting ones among them are identified with a separate machine learning model. This model is too expensive to consider all chunks in the data, but still cheap enough that it can analyse more chunks than would fit into the LLM context.

Finally, the query with which we search can also be improved. By default, the user's question is used. However, we can also ask another large language model to first reformulate this question into a more fitting search term. This can be anything from specialised keywords to multiple questions to potential answers.

Agents

As mentioned, the retrieval system has one chance of finding the right information. A more general approach to Q&A systems changes this by putting a large language model into the lead of the whole information retrieval process. The model can decide to actively search for information, look at the results, and then search again with different words or for another topic, if necessary. This is done with so-called agents, where a large language model not only talks to the user but also decides to use a tool instead, like a search engine (our retrieval system) or even a database for structured information. This paradigm can be very powerful in certain scenarios, but usually only works satisfactorily if the large language model is trained to work with external tools.

Graphic that shows how the user interacts with the LLM agent and the LLM agent with the unstructured data, structured data and proprietary application Agent-based RAG system

Conclusion

We are confident that in the future, the combination of LLMs with proprietary information is a key ingredient in making them more useful. Access to tools like a search engine, but also databases with structured information or even the possibility to actively trigger an action (like sending an email or adding a row to a spreadsheet) allows for a wide range of new use cases which were not possible before. We are entering a new area of automation, augmentation, and user interaction.

Thanks to our expertise in generative AI solutions and experience in many projects across all major industries we can support you in your generative AI transformation. Our offerings include consultancy on use case portfolios, discovery phases of specific use cases, prototyping, as well as implementing and integrating fully fledged bespoke generative AI solutions.

Philipp Morf
Contact person for Switzerland

Philipp Morf

Head AI & Data Practice

Dr. Philipp Morf holds a doctorate in engineering from the Swiss Federal Institute of Technology (ETH) and holds the position head of the Artificial Intelligence (AI) and Machine Learning (ML) Solutions division at Zühlke since 2015. As Director of the AI Solutions Centre, he designs effective AI/ML applications and is a sought-after speaker on AI topics in the area of applications and application trends. With his many years of experience as a consultant in innovation management, he bridges the gap between business, technology and the people who use AI.

Contact
Thank you for your message.