Implementing Data Isolation in an RAG System in GCP using any of the LLM models

Question

I am currently working on developing a Retrieval Augmented Generation (RAG) system where User-1 and User-2 each have their unique set of documents. My goal is to create a system where User-1's queries only receive responses from their own documents without any interference from User-2's data, and vice versa. Maintaining data confidentiality and security is crucial for this project. I am using langchain with LLM models to structure my data, and I am seeking advice on how to implement data isolation effectively to ensure that the private documents of one user are protected from another. I have tried the following approaches: Using separate directories for different users to store their documents in GCP Cloud storage. Planning to use separate collections in vector DB for each user to ensure data isolation.

However, both methods have their drawbacks in terms of scalability and performance. I would appreciate any suggestions or recommendations from the community on how to structure data isolation in this scenario. Some questions I have are: Are there any best practices for data isolation? What are some efficient ways to maintain data security while ensuring good performance in a RAG based architecture system? Should I consider using other Python libraries or tools to achieve better data isolation? Any help or advice would be greatly appreciated. Thank you all in advance!

score 3 · Answer 1 · answered Apr 24 '24 at 08:33

You can add metadata fields to your documents and then filter by such metadata info in the query.

The ability to add metadata to the vector store depends on the actual vector DB used. For instance, you can find how to do that for pgvector here.

Then, to incorporate metadata filters, you can follow this example.

Implementing Data Isolation in an RAG System in GCP using any of the LLM models

1 Answers1