I am currently working on developing a Retrieval Augmented Generation (RAG) system where User-1 and User-2 each have their unique set of documents. My goal is to create a system where User-1's queries only receive responses from their own documents without any interference from User-2's data, and vice versa. Maintaining data confidentiality and security is crucial for this project. I am using langchain with LLM models to structure my data, and I am seeking advice on how to implement data isolation effectively to ensure that the private documents of one user are protected from another. I have tried the following approaches: Using separate directories for different users to store their documents in GCP Cloud storage. Planning to use separate collections in vector DB for each user to ensure data isolation.
However, both methods have their drawbacks in terms of scalability and performance. I would appreciate any suggestions or recommendations from the community on how to structure data isolation in this scenario. Some questions I have are: Are there any best practices for data isolation? What are some efficient ways to maintain data security while ensuring good performance in a RAG based architecture system? Should I consider using other Python libraries or tools to achieve better data isolation? Any help or advice would be greatly appreciated. Thank you all in advance!