5

I've asked on stackoverflow already (here), but I figured that the approach of storing embeddings in an ordinary postgres-Database might be flawed from the very beginning. I will shortly etch out the application again:

  • text corpora (few hundred thousand documents, containing a few paragraphs)
  • embeddings create with BERT (for each paragraph)
  • Application: similarity search (retrieve similar paragraphs and reference to the document)

I've seen tutorials about creating embeddings with BERT etc. and it all works. The Crux I have is how to manage having a few million embeddings and searching for similar ones. Where to store them, plus the additonal information (raw text related to the embeddings and document which contains the text).
So the question is:
How does one store a few million embeddings (768-Dimensional numpy arrays) in an efficient and searchable way without using cloud-environments (data privacy reasons)?
Is Tensorflow Records the right answer?
Is it in the end a relational database?
Is it something different? It's my first NLP task and I might simply not know the obvious answer. However, searching on stackexchange and google didn't provide a solution.

Angus
  • 51
  • 1
  • 2

3 Answers3

2

There's Milvus search engine that utilizes several prominent Approximate KNN libraries such as FAISS, ANNOY and HNSW. It also handles several bookkeeping, clustering, data integrity and other tasks that you probably don't want to handle yourself. All for a performance price ofc, but if you don't want to pay it, you can always pick one of the "barebones" libraries.

SimpleV
  • 121
  • 3
0

Why don't you cluster similar embeddings and store then use hashing to search faster. Then you can store them anywhere, maybe in bigdata hdfs distributed system for faster retrieval or simply hashed clusters in databases if you are in research or POC environment.

I have also seen some other techniques for information retrieval, in which you apply TF IDF or simpler search techniques for first filtering out the text of interest and then work on 768 dim embeddings. This way is faster if search is your primary target.

Sandeep Bhutani
  • 914
  • 1
  • 7
  • 26
0

My answer would be it depends on your creativity. I've seen people storying them in numpy files, pickle files, graph databases and etc.

So I would say it doesn't matter where you store them, It's your code that needs to adapt to the stored files.

For similarity search, you can use indexing algorithms to make it faster. FAISS is a solution to this.

Fatemeh Rahimi
  • 569
  • 3
  • 16