8

I made a similar question asking about distance between "documents" (Wikipedia articles, news stories, etc.). I made this a separate question because search queries are considerably smaller than documents and are considerably noisier. I hence don't know (and doubt) if the same distance metrics would be used here.

Either vanilla lexical distance metrics or state-of-the-art semantic distance metrics are preferred, with stronger preference for the latter.

Matt
  • 821
  • 1
  • 8
  • 12

2 Answers2

4

The cosine similarity metric does a good (if not perfect) job of controlling for the document length, so comparing the similarity of 2 documents or 2 queries using the cosine metric and tf idf weights for the words should work well in either case. I would also recommend doing LSA first on tf idf weights, and then computing the cosine distance\similarities.

If you are trying to build a search engine, I would recommend using a free open source search engine like solr or elastic search, or just the raw lucene libraries, as they do most of the work for you, and have good built in methods for handling the query to document similarity problem.

Simon
  • 1,036
  • 10
  • 10
4

From my experience only some classes of queries can be classified on lexical features (due to ambiguity of natural language). Instead you can try to use boolean search results (sites or segments of sites, not documents, without ranking) as features for classification (instead on words). This approach works well in classes where there is a big lexical ambiguity in a query but exists a lot of good sites relevant to the query (e.g. movies, music, commercial queries and so on).

Also, for offline classification you can do LSI on query-site matrix. See "Introduction to Information Retrieval" book for details.

Alx49
  • 56
  • 1