Retrieval-Augmented Generation for Identifying Similar and Duplicate Controls in a Dataset

Question

I'm exploring the feasibility of implementing a Retrieval-Augmented Generation (RAG) system to tackle a specific use case involving control identification in a dataset. The objective is to identify similar and duplicate controls within the dataset by comparing each control with every other control present. The implementation should calculate similarity scores for pairs of controls, considering controls with similarity scores between 80-87 as similar and those exceeding 95 as duplicates.

The dataset is available in a CSV file format, and the desired output includes displaying relevant pairs of similar and duplicate controls based on the input prompt provided.

The prompt given is : "Identify Similar and Duplicate controls for each control with every other control present in the dataset, also calculate similarity score for similar controls between a threshold of 80-87 and duplicate controls exceeding a threshold of 95"

I'm seeking insights, suggestions, or any guidance on how to proceed with the implementation of this Retrieval-Augmented Generation system for this particular use case. Any pointers,or references this usecase would be greatly appreciated. Thank you!"

Retrieval-Augmented Generation for Identifying Similar and Duplicate Controls in a Dataset

0 Answers0