How does RAG (Retrieval Augmented Generation ) work around limited context length?

Question

My understanding of the RAG pipeline can be summarized with the following diagram: diagram

I understand steps 1-7 splits and vectorizes an external text data source into chunks and steps 8-11 retrieves n relevant chunks based off the user input query and some measure of vector similarity between text chunks and query.

What I'm not sure of is steps 12-13.

I am currently building a RAG chatbot using llama 2 and have tried prompting using the following format:

Given the following context: {insert text chunks} and no other information, answer the question:{user input query}.

The issue I'm running into is that I run into max prompt length errors due to the size and number of retrieved text chunks used in the prompt. What solutions are available?

Note- I'm trying not to use APIs like replicate.com to host the model.

score 1 · Answer 1 · answered Jan 03 '24 at 04:47

There are two generic solutions/ post-processing steps to work around hitting the max_token limit irrespective of the model/ LLM you are consuming in langchain.

Retain the top_n matches or chunks based on similarity scores where top_n would depend on the average chunk size and max_token restriction of the LLM in consumption.
Extract only relevant part of the document returned from vector search. You can look into this langchain blog for Contextual Compression.

How does RAG (Retrieval Augmented Generation ) work around limited context length?

1 Answers1