8 minutes to read With insights from... Silvan Melchior Lead Data Scientist silvan.melchior@zuehlke.com Dr. Gabriel Krummenacher Head of Data Science gabriel.krummenacher@zuehlke.com More than one year after the release of ChatGPT, we can easily identify the Large Language Model (LLM) use case, that has the most impact: Chatbot-based Q&A systems on proprietary customer data. The demand for it stretched across all industries. Together with an insurance company, we built a system that can answer questions like “Is my watch covered by my current policy?”. With a bank, we developed a solution which answers questions about regulations. Lastly, with a telecom provider, we built a system which helps deal with everyday requests from their customers. The technology behind all these use cases is called Retrieval Augmented Generation (RAG). It combines an LLM with a retrieval system, so the LLM can search for relevant information outside of its training data, to then generate an informed response to any query. The approach allows to easily update / extend the knowledge of a model without fine-tuning and drastically reduces hallucinations. It already manifested itself as the key ingredient to many new LLM-powered products, most notably Microsoft Copilot. While these products can provide value for companies in some scenarios, we often observed situations in the past in which they fell short: An off-the-shelf solution may not produce satisfactory outcomes, such as providing incorrect responses, particularly if the application involves more complex requirements, such as domain-specific data or inherent structures within the data. The data may reside in a location that is inaccessible to standard, off-the-shelf tools due to technical and/or legal constraints, such as proprietary software environments or on-premises infrastructure setups. The data may be of heterogeneous type, such as databases, diagrams, or forms, which might prevent it from being used by an off-the-shelf tool. The licensing scheme and cost structure may not align with the requirements of the organisation. A custom solution, built using RAG, may effectively address these constraints and allows your company to still profit from the advances of generative AI in the last months and years. Depending on the situation, this can be an extension of existing solutions (for example the OpenAI Assistants API) or an even more bespoke solution, potentially using open-source large language models deployed on any infrastructure. In the remaining part of this blogpost, we will first discuss the basic principles of RAG from a technical perspective. We will then identify the shortcomings of the basic approaches and show how we overcame them in different projects. Exemplary use cases for RAG systems How does RAG work? RAG combines an LLM with an information retrieval system. For every user question, this system is first used to find information which might answer this question. Then, this information is fed together with the user question to the LLM, so it can generate an informed response. The retrieval system is usually built using so-called embeddings. Embeddings take a snippet of text and put them into a mathematical vector space. They do this in a way that places texts that talk about similar things close in this vector space. This allows to search for information by embedding all the details and the user query. We can then identify the texts which are closest to the query in the vector space. These texts likely contain information which is relevant to the query. One important ingredient of this approach is the so-called chunking. What we do there is take our texts and split them into smaller parts (chunks) of, for example, a few hundred words each. Then, these chunks are embedded separately, so we cannot just search for a whole text but for relevant parts of it only. A basic RAG system While this simple setup works surprisingly well in many situations, it falls short in others. Usually, the following challenges can be identified: Embedding-based search often fails: While these embeddings are great in correctly capturing the meaning of synonyms and the like, they are not perfect. For certain types of data, certain embedding models are even quite bad, for example legal text or company names. Thus, as your information base increases in size, the likelihood of finding the correct chunks decreases with it. Chunks miss context: Even if the correct chunk is found, it is still only a small part from an overall text. The missing context around it might lead the LLM to interpret the content incorrectly. One-shot approach prevents proper search: The retrieval system has one chance of finding the correct information. If the user typed in the query in an unusual manner, it might fail. If the found information requires a follow-up question, it can't answer it. How to improve RAG performance In the cases mentioned before, the model either does not provide an answer at all, or worse, it provides the wrong one. Thus, there are multiple extensions or adjustments of the basic RAG setup which, depending on the problem, can help. Potential improvements in RAG systems Suitable model Unsurprisingly, your model selection heavily impacts the final performance. LLMs have a limit in how much information they can process at once. Thus, we can only provide it with a certain number of chunks to help answer the question. Using a model with a larger context length is an easy way to improve a RAG system since including more chunks decreases the likelihood of missing the relevant ones. Luckily, newer models drastically increased the context length, GPT-4 for example was updated in November 2023 to support 128’000 tokens already, which corresponds to roughly 200 pages of text. And in February 2024, Google announced a breakthrough in their Gemini series, soon supporting one million tokens. However, there is a caveat to this solution. The longer the context, the harder for a model to find the relevant information in it. Furthermore, it also depends where in the context the information is provided. For example, information in the middle is usually weighted less. Last but not least, inference with very long contexts requires a lot of resources. Thus, a trade-off must be made based on careful evaluation. The context length is not the only selection criteria for your model. If you have highly specific data, for example medical documents, a model trained on such data might outperform a more standard one, even though it might have a smaller context length. Better chunking Chunking, the process of cutting the information into small parts which are searchable, heavily impacts the retrieval performance. If the chunk is too small, the LLM has a hard time interpreting it because of the missing context (the original text). If a chunk is too large, the embedding vector will become very general because the text starts to contain different topics, and so the search performance decreases. Again, an evaluation specific to the use case must be made to find the optimal chunk size and overlap. Furthermore, adaptive chunking can be used, so that not all chunks are the same size. Adaptive chunking usually considers document structure (e.g. paragraphs) and might use embeddings again to measure the similarity in discussed topics between different passages of the text. Improved search As already discussed, embedding-based search has its limits. A common way to improve it is to combine it with more classical, keyword-based search. This combination, often called hybrid search, usually outperforms pure embedding-based search. Since the number of chunks which can be fed into the LLM is limited, there is often a reranking step happening. More chunks than could fit into the LLM context are first retrieved and then the most fitting ones among them are identified with a separate machine learning model. This model is too expensive to consider all chunks in the data, but still cheap enough that it can analyse more chunks than would fit into the LLM context. Finally, the query with which we search can also be improved. By default, the user's question is used. However, we can also ask another large language model to first reformulate this question into a more fitting search term. This can be anything from specialised keywords to multiple questions to potential answers. Agents As mentioned, the retrieval system has one chance of finding the right information. A more general approach to Q&A systems changes this by putting a large language model into the lead of the whole information retrieval process. The model can decide to actively search for information, look at the results, and then search again with different words or for another topic, if necessary. This is done with so-called agents, where a large language model not only talks to the user but also decides to use a tool instead, like a search engine (our retrieval system) or even a database for structured information. This paradigm can be very powerful in certain scenarios, but usually only works satisfactorily if the large language model is trained to work with external tools. Agent-based RAG system Conclusion We are confident that in the future, the combination of LLMs with proprietary information is a key ingredient in making them more useful. Access to tools like a search engine, but also databases with structured information or even the possibility to actively trigger an action (like sending an email or adding a row to a spreadsheet) allows for a wide range of new use cases which were not possible before. We are entering a new area of automation, augmentation, and user interaction. Thanks to our expertise in generative AI solutions and experience in many projects across all major industries we can support you in your generative AI transformation. Our offerings include consultancy on use case portfolios, discovery phases of specific use cases, prototyping, as well as implementing and integrating fully fledged bespoke generative AI solutions. Contact person for Switzerland Philipp Morf Head AI & Data Practice Dr. Philipp Morf holds a doctorate in engineering from the Swiss Federal Institute of Technology (ETH) and holds the position head of the Artificial Intelligence (AI) and Machine Learning (ML) Solutions division at Zühlke since 2015. As Director of the AI Solutions Centre, he designs effective AI/ML applications and is a sought-after speaker on AI topics in the area of applications and application trends. With his many years of experience as a consultant in innovation management, he bridges the gap between business, technology and the people who use AI. Contact philipp.morf@zuehlke.com +41 43 216 6588 Your message to us You must have JavaScript enabled to use this form. First Name Surname Email Phone Message Send message Leave this field blank Your message to us Thank you for your message.
Insurance – Understanding risk in a digital world: the power of digital twins in insurance Learn more
Commerce and Consumer Goods – Using data effectively – five stumbling blocks on the road to becoming a data-driven company Learn more