My understanding of the RAG pipeline can be summarized with the following diagram:
I understand steps 1-7 splits and vectorizes an external text data source into chunks and steps 8-11 retrieves n relevant chunks based off the user input query and some measure of vector similarity between text chunks and query.
What I'm not sure of is steps 12-13.
I am currently building a RAG chatbot using llama 2 and have tried prompting using the following format:
Given the following context: {insert text chunks} and no other information, answer the question:{user input query}.
The issue I'm running into is that I run into max prompt length errors due to the size and number of retrieved text chunks used in the prompt. What solutions are available?
Note- I'm trying not to use APIs like replicate.com to host the model.