0

Sorry the broad and naive question, but the structure I have in mind is as follows:

  1. Extract the text from a large collection of Documents with varying types.
    • This part I plan to use Apache Tika and for simplicity let's assume this works for most use cases.
  2. Chunk the text content for each document and run it through a Text Embedding model to generate Vectors. This will serve for simplistic underpinning of semantic search.
  3. Store the Embeddings for each Document with a reference to a unique Document Id.
  4. [UNCERTAIN HERE] Now, I'd like to run the raw text chunks from each document through another model to identify email addresses and names (this could also happen as parallel process in Step 1). The aim would be to create a Record Linkage of all emails down to a distinct Entity, so that [bdip@job.com, bob.dip@personal.com] would get roughly captured and identified as the same person. The desired outcome would be way to run a query on my Embeddings along the lines of this psuedocode:
  • "Find all documents with text LIKE 'XYZ' AND ENTITY_ID = BOB"

After running one model for Text Embeddings on a set of documents and a second model for Record Linkage of Emails on the same document, what is the best-practice approach for creating a Model that relates Entities from the Record Linkage Model to all the Documents where variations of that Entity appear? Do I even need a third Model here or, if assuming the Entities and Text Embeddings both link to a DocId, can I simply join the vector similarity query for the Semantic Text to the entity-existence query by DocId?

Finally, if you've made it this far (thank you), what Model(s) would you consider for Record Linkage by email?

john_mc
  • 101
  • 1

0 Answers0