1

My understanding of the RAG pipeline can be summarized with the following diagram:diagram

I understand steps 1-7 splits and vectorizes an external text data source into chunks and steps 8-11 retrieves n relevant chunks based off the user input query and some measure of vector similarity between text chunks and query.

What I'm not sure of is steps 12-13.

I am currently building a RAG chatbot using llama 2 and have tried prompting using the following format:

Given the following context: {insert text chunks} and no other information, answer the question:{user input query}.

The issue I'm running into is that I run into max prompt length errors due to the size and number of retrieved text chunks used in the prompt. What solutions are available?

Note- I'm trying not to use APIs like replicate.com to host the model.

noe
  • 28,203
  • 1
  • 49
  • 83
Clement
  • 11
  • 3

1 Answers1

1

There are two generic solutions/ post-processing steps to work around hitting the max_token limit irrespective of the model/ LLM you are consuming in langchain.

  • Retain the top_n matches or chunks based on similarity scores where top_n would depend on the average chunk size and max_token restriction of the LLM in consumption.
  • Extract only relevant part of the document returned from vector search. You can look into this langchain blog for Contextual Compression.
Mankind_2000
  • 840
  • 5
  • 10