How does vector space model differs from traditional B-tree indexes

Question

I've asked the following question on StackOverflow: https://stackoverflow.com/questions/71819288/mongodb-vs-elasticsearch-indexing-parallel-arrays

I think that getting theoretical information on this subject can help me better to grasp the context and maybe I'll be able to answer my own question.

Note: I've changed the title, here it is more comments for clarification.

How does ElasticSearch like solutions differs from traditional databases? On storing, retrieving, modeling data. Most importantly, in case I want to do exact matching and filtering. Will there be real benefit of using such a technology considering I created my NoSQL databases indexes perfectly.

score 1 · Answer 1 · answered Jul 30 '24 at 06:51

Think of a simple example. You have a few books about "sports" and you have a few books about "movies". You had two options: 1) Put all the contents of each of the book in a DB (say, a table looks like bookid, bookname, content) and 2) Create an inverted index or put it in an embedding space as in a Vector Space Model (VSM).

If you choose the first option, your ability to fetch books having the word "hockey" will be extremely expensive. On the other hand, if you had chosen to build an inverted index where for every word, you have a list of books as in: hockey -> bookname1, bookname2, ... Now, a search for "hockey" can retrieve the list of books almost instantly.

You could take a step further and build a vector space where each dimension of this space is a word and there is a book which is basically a bag of words. This way, if you want to know which books are similar, you can simply check the distance in the n-dimensional space (using cosine similarity or such). This is the basic idea.

Rinkesh P · Answer 2 · 2022-04-13T04:14:24.390

*In response to the original question, inverted index vs btree indexes*

Lets look at what these 2 types of indexes are

Forward Index/Index

Here the search key(attribute on which the index is built) is the name of the document. Consider a telephone book, where names are sorted alphabetically. While searching the phone number for a person whose name starts with N, you directly go the page where names from N start and then continue searching till you find the name you are looking for.

Inverted Index

Here the search key is a part of the content of any of the document. Take for instance the index at the back of a reference book. The keywords are listed alphabetically and the index show the page numbers on which a given keyword is present. In this case too if you want to find where a keyword starting with P occurs, you would directly go to the page where keywords for P are listed and then search further.

Knowing both these ideas, you might think, can one use a forward index to search the pages where a keyword occurs, or can you use an inverted index to find the phone number of a person, and the answer is yes, you can search anything using any type of index theoretically. However, the main purpose of an index is to make the lookup/search take less time, and this depends on what you are trying to search and what indexes you are using to search.

Consider a database having reviews for a movie stored area wise. Assume it has 3 columns id, area_code, review. Id is the primary key, areacode is a unique integer representing some area and review is a text review.

Lets say you want to find out what the users of area 007 think about your movie, in this case you create an index on the column movie and easily find out the reviews for that area.

Now consider that you want to find out the number of users who gave a positive review to your movie. A review would be a text, of highly variable length and content and very high redundant content. But you assume that certain terms like "excellent", "mind bending", "masterpiece" etc could be found in a positive review. So here you use an inverted index, which basically would tell you in which reviews there were positive terms used, so you can get a rough estimate of the count.

And finally you can combine both scenarios where you want to find out the good reviews of a particular area, so you use both the indexes.

This is a theoretical description of what those indexes are and how they are to be used. But when you need to implement them, you need to think about how it would perform (reducing disk accesses or as in my example at the beginning, you need some way to reduce the number of pages you have to go through) and one way to approach this is to use B-trees.

You can read upon B-trees anywhere, but to answer your question, a both the forward and inverted index can use B-trees, because in both the cases you have an index file which needs to be accessed efficiently, and B-tree lets you do exactly that.

How does vector space model differs from traditional B-tree indexes

2 Answers2