9

I am teaching myself Information Retrieval from Christopher Manning's book (PDF link: http://nlp.stanford.edu/IR-book/pdf/01bool.pdf). I tried Exercise 1.13:

"Try using the Boolean search features on a couple of major web search engines. For instance, choose a word, such as burglar, and submit the queries (i) burglar, (ii) burglar AND burglar, and (iii) burglar OR burglar. Look at the estimated number of results and top hits. Do they make sense in terms of Boolean logic? Often they haven’t for major search engines. Can you make sense of what is going on?"

By my knowledge of Boolean logic, the number of results should be like this:

burglar AND burglar <= burglar OR burglar = burglar

But this isn't so. In fact, on Google, it is:

burglar > burglar OR burglar > burglar AND burglar

So, what exactly is happening behind the scenes? Any pointers?

Note: This is NOT a homework problem, even though it is from the exercise of a textbook.

DaL
  • 2,663
  • 13
  • 13

3 Answers3

6

Nice question!

An exact answer should be given by looking in the search engine source code but here is a possible explanation.

I run the queries at Google

  • burglar 33,800,000
  • burglar AND burglar 29,200,000
  • burglar OR burglar 26,500,000

The results indeed do not respect the expected Boolean relation burglar AND burglar <= burglar OR burglar = burglar

However, that is since the search engine doesn't process the "and" and "or" as binary operator but just as search tokens. Looking for them we get

  • And 25,270,000,000
  • Or 16,320,000,000

A term alone appears most times. "and" is more common than "or" so a term with "and" is more common than the term with "or"

Note that

  • burglar burglar 29,000,000

Apparently looking for documents in which the term appears twice.

By the way, Google's Search operators documentation claim that "OR" should indeed act as a binary operator. You found a case in which they fail to do so.

Note that this behaviour is very specific to search engine. In Bing you get the following results:

  • burglar 4,400,000
  • burglar AND burglar 1,610,000
  • burglar OR burglar 1,610,000

  • And 10,400,000,000

  • Or 3,750,000,000

  • burglar burglar 1,610,000

The number of results is similar for "burglar AND burglar", "burglar OR burglar" and "burglar burglar" though we see that "And" is more popular than "OR". It seems that Bing treatments is the removal of "And" and "OR", possibly as stop words.

Bing documentation suggest the operators "&&" for "and" and "||" for "or". - burglar || burglar 4,400,000 = burglar - burglar && burglar 1,610,000 = burglar burglar

These results fit the claim that when a term appears twice in the search query it should appear at least twice in the document too.

DaL
  • 2,663
  • 13
  • 13
6

Google used to do, to some extend. For a long time, using +word could be used to require the presence of a word. So "a AND b" would be "+a +b" whereas "a OR b" would be "a b" (with a preference to both occurring).

But people did not use it much, so they eventually removed it.

Google thinks it is more important to be able to process natural language queries rather than some mathematical formalism less than 0.1% of the users understand.

Although there are also some other hypotheses why it was removed: Why was the Plus Sign (+) removed as a Search Operator?

Has QUIT--Anony-Mousse
  • 8,134
  • 1
  • 16
  • 31
1

This question has much broader implications. E-commerce is greatly hindered by pi$$ poor search engines

Basic common sense says that as the number of search terms increases the number of results should decrease. As a helpful adjunct a search engine could/should give the helpful advice "to get more hits please reduce the number of search terms".

Instead it seems the fad amongst current programmers is that the indices of a good search engine is that it produces lots of hits. It then leaves the searcher having to go through pages and pages of hits to find what s/he is looking for. When they get bored and leave, the sale is lost. I worked at Ace Hardware for a while. I had access to two search engines, the public one, and the retailer only version. Both were horrific. Often I would spend 15 minutes trying to find something I knew for certain was available, but that was hidden by the search stupidity.

Instead of fixing such stupidity modern programmers try to fix the flaw with various buttons to narrow down the search results. But that is a futile effort. At least in some cases it provides an adequate solution, but still poor architecture. Also it is slowly rotting the brain of users to assume there is no better way.

Another inadequate "fix" used for e retail is to use "properties" to group together item. The problem is that the properties are sometimes assigned by the manufacturers, sometimes by a bored data entry person. Some were entered 30 years ago, and some today. This approach can only work if you have Information Scientists (IS or Library Science, not IT) assign the properties uniformly across all products in a given dataset. Only in that case is it a good solution (if I were younger I would go get an IS degree and my masters would be to find a retailer like Ace and demonstrate how sales could improve by doing so). In the meanwhile this anti-search fad is causing massive loss of sales. It is creating lots of jobs and billions of lines of code to solve a problem that should not even exists.

Getting of my soap box. I found this article researching this. My guess is that data sets are just too big to produce boolean results in real time and thus the whole house of cards.