GenAI RAG: how to optimise your RAG AI model

With insights from

Silvan Melchior

Lead Data Scientist

Dr. Gabriel Krummenacher

Head of Data Science

What is RAG in AI?

RAG is the enabling technique behind Q&A chatbot systems – one of the leading use cases for GenAI across a host of industries.

The RAG technique involves expanding the capabilities of a large language model (LLM) by combining it with a retrieval system. This enables the model to fetch external knowledge beyond its training data.

In this way, RAG enables an LLM to provide more informed, accurate, and up-to-date responses to user queries, expanding the model’s knowledge without the need for fine-tuning.

A key ingredient in many LLM-powered products like Microsoft Copilot, RAG brings a number of benefits to LLMs:

Improves the accuracy and factual correctness of responses
Reduces hallucinations (where LLMs make-up info where they lack knowledge)
Keeps information current without fine-tuning and retraining the model
Provides context for the outputs of your LLM
Improves transparency and trust by enabling source citation

It’s no surprise then that we’ve seen huge appetite for RAG GenAI applications across a host of industries and applications – from streamlining internal processes to enhancing customer service and customer satisfaction.

Here at Zühlke, we’ve used RAG to help a manufacturer save thousands of hours of manual work. We helped an insurer halve the time it takes to retrieve tariff information to support customer queries. We helped a bank improve response accuracy with an LLM-powered tool. And we enabled another to develop, test, and validate RAG-enabled solutions before they’re scaled across the business.

Example use cases of RAG in Insurance, Industrial Sector, Banking, Telecommunication and Pharma sector. — An illustration of RAG system use cases across different industries

Of course, the quality of outputs you get from these RAG AI models depends on the information they can access. While most early GenAI experiments focused on using external public data, businesses are recognising that giving models increased access to proprietary and sensitive data can unlock more complex, value-driving use cases.

The limitations of RAG in off-the-shelf products

While off-the-shelf models can provide value for companies in some scenarios, they can fall short on a few fronts:

An off-the-shelf solution may produce unsatisfactory outcomes including incorrect responses, particularly if the application involves more complex requirements, such as domain-specific data or inherent structures within the data.

The data may reside in a location that’s inaccessible to standard, off-the-shelf tools due to technical and/or legal constraints, such as proprietary software environments or on-premises infrastructure setups.

The required data may be of a heterogeneous type (think databases, diagrams, or forms), which could prevent it from being used by an off-the-shelf tool.

The licensing scheme and cost structure may not align with the requirements of the organisation.

If you're looking to address these constraints and advance value-driving use cases in your business, you’ll need to develop your own, custom solution. This could involve extending an existing solution – for example, the OpenAI Assistants API – or an even more bespoke solution – for example, using open-source LLMs deployed on any infrastructure.

Customising LLMs and combining them with a RAG system is no walk in the park, however. And to navigate implementation challenges effectively, we first need to understand the basic principles of RAG.

How does GenAI RAG work?

RAG combines an LLM with an information retrieval system. For every user question, this system is first used to find information that might answer the question.

Then, this information is fed into the LLM, together with the user question, so the model can generate an informed response.

Graphic basic system RAG: First step: Rich information is stored in retrieval system: arrow from "proprietary data" to "Database/vector store". Second step: User question is used to search for relevant information: arrow to database/vector store. Third step: Most relevant results are identified: arrow from database/vector store to context. Fourth step: Question is fed to LLM together with relevant information: arrow to LLM and also arrow from "context" to "LLM". Fifth step: LLM responds: arrow to user. — A retrieval augmented generation diagram illustrating a basic RAG system

The retrieval system is usually built using embeddings. Embeddings take a snippet of text and put them into a mathematical vector space. They do this in a way that positions texts on related topics closely together within the vector space. By embedding the user query and details within the same space, the system can search and easily locate information that’s likely to be most relevant to the query.

One important ingredient of this approach is the so-called chunking. What we do there is take our texts and split them into smaller parts (chunks) of, for example, a few hundred words each. These chunks are then embedded separately, so we cannot just search for a whole text but for relevant parts of it only.

The limitations of a basic RAG system

While this simple setup works surprisingly well in many situations, it falls short in others. Common challenges include:

Embedding-based search often fails: While these embeddings are great at correctly capturing the meaning of synonyms and the like, they’re not perfect. For certain types of data, certain embedding models are even quite bad. For example, legal text or company names. As your information base increases in size, the likelihood of finding the correct chunks decreases.

Chunks miss context: Even if the correct chunk is found, it is still only a small part of a wider text. The missing context around it might lead the LLM to interpret the content incorrectly.

A one-shot approach prevents proper search: This kind of retrieval system has one chance of finding the correct information. And, if the user frames the query in an unusual manner, the system might fail to deliver a valuable output. If the found information requires a follow-up question, it will likely struggle to answer it.

How to improve the performance of a GenAI RAG system

In the cases mentioned before, the model either does not provide an answer at all, or worse, it provides the wrong one. Thankfully, there are multiple extensions or adjustments you can make in the basic RAG setup which, depending on the problem, can help.

Basic LLM systems with additional potential improvements marked on the graphic in green. On the first step: Better chunking. Third step: Improved search. Fourth step (LLM): suitable model and agents — A diagram showing how you can optimise a RAG system with improved search, better chunking, and the right model

Adopt the right model for your specific use case

Unsurprisingly, your model selection heavily impacts the final performance of your RAG system. LLMs are limited by the amount of information they can process at once. You can only provide it with a certain number of chunks to help answer a question.

'An easy way to address this and improve your RAG system is to use a model that supports a larger amount of contextual information, because including more chunks decreases the likelihood of missing the relevant ones'.

Luckily, newer GenAI models have drastically increased the context length. GPT-4, for example, was updated in 2023 to support 128,000 tokens, which corresponds to roughly 200 pages of text. And in early 2024, Google announced a breakthrough in their Gemini series, soon supporting one million tokens.

But there’s a caveat to this solution. The longer the context, the harder it becomes for a model to find the relevant information. The location of information within contextual chunks impacts performance too. For example, information in the middle of a chunk tends to have less weighting. What’s more, it requires a lot of resources for your system to infer meaning from very long contexts. And so you need to evaluate carefully the degree of trade-off between resources and performance accuracy.

The context length is not the only selection criteria for your model. If you have highly specific data, for example medical documents, a model trained on such data might outperform a more standard one, even though it might have a smaller context length.

Optimise your LLM chunking

Chunking – the process of cutting information into small, searchable parts – heavily impacts retrieval performance.

If the chunk is too small, the LLM has a hard time interpreting it because of the missing context (the original text). If a chunk is too large, the embedding vector will become very general because the text starts to contain different topics, and so the search performance decreases.

Again, you need to carefully evaluate your use cases to find the optimal chunk size and overlap. Adaptive chunking is a useful approach given that it supports chunks of varying sizes. This technique usually considers document structure (e.g. paragraphs) and might use embeddings to measure the similarity across topics covered in different passages of the text.

Implement hybrid search and complementary LLMs

Embedding-based search has its limits, as we explored earlier. A common way to improve it is to combine it with more classical, keyword-based search. This combination, often called hybrid search, usually outperforms pure embedding-based search.

Since there’s a limit to the number of chunks you can feed into an LLM, a re-ranking step often takes place. This is where you retrieve additional chunks that couldn’t fit into the LLM context and then the most fitting ones are identified using a separate machine learning model. This model is too expensive to consider all chunks in the data, but still cheap enough that it can analyse more chunks than would fit into the LLM context.

Improving the search query is another way to improve outcomes. A user’s query is used by default. But a great option can be to ask another LLM to reformulate the original question into a more fitting search term. This could be anything from identifying specialised keywords to uncovering multiple questions to potential answers.

Empower your model to use external tools

Your retrieval system has one chance of finding the right information. One way to streamline this process and get the most accurate responses is to put the LLM in charge of the entire information retrieval process.

This is done with so-called agentic AI, where a large language model not only talks to the user, but can opt to use a tool, like a search engine or database, to locate structured information.

The model can decide to actively search for information, look at the results, and then search again using different words, or for another topic as needed.

This paradigm can be very powerful in certain scenarios, but usually only works well if the large language model is trained effectively to work with external tools.

Graphic that shows how the user interacts with the LLM agent and the LLM agent with the unstructured data, structured data and proprietary application — An example of an agent-based RAG system

LLMs and proprietary data: a powerful combination

We’re confident that the combination of LLMs with proprietary information is critical to making these models more effective and efficient. Giving GenAI models access to tools like search engines and databases with structured information – together with the capability to trigger actions like sending emails or adding information within a spreadsheet – unlocks a wide range of new use cases not previously possible.

'We’re entering a new area of automation, augmentation, and user interaction. And getting ahead of this curve is essential for rewiring your business for long-term growth and competitive advantage'.

Here at Zühlke, we’re helping businesses across complex and regulated ecosystems to turn AI opportunities into value-driving use cases and scalable solutions. Talk to us today about how we can help you develop you ideate, build, and scale bespoke products and solutions while mitigating risk, adopting responsible practices, and ensuring early and ongoing value.