How to evaluate recommendation engine without ground truth?

Question

I have developed an algorithm which recommends geographical locations to users based on popular trends and their own interests. The dataset is created by my organization. So the user selects a few categories and based on his interest and rating by other people he is presented with recommended places for him.

How do you evaluate or use metrics for such system given that ground truth doesn't makes sense in this case ?

EDIT: I meant that how to evaluate the accuracy or quality of results in such case especially for published work.

EDIT 2 (as per the request for details in the comments)

Details of the system

The user indicates his preferences from a predefined set of tags (cultural, mountains etc)
Users can also rate different places on a scale of 1-5.
Users profile (geographical location, places already visited etc are already stored in his profile)
Based on user's choice and other heuristics(rating etc), a set of places is recommended by the system.

Tolga Birdal · Answer 1 · 2014-01-13T19:42:49.863

One of the best ways to benchmark such big systems is to use a service like Amazon Mechanical Turk (AMT). First you implement a user interface for everyday users where they could specify whether or not the recommendations are relevant. Then you submit this to Amazon Turk and let many people try it out.

You could further process the results of an Amazon Turk to really benchmark your system and obtain quantification of accuracy of it.

Team of Fei Fei Li used similar technique for benchmarking their Fine Grained segmentation algorithm. The paper is here: http://vision.stanford.edu/pdf/DengKrauseFei-Fei_CVPR2013.pdf

Naturally using AMT won't give you a realtime solution, where you get a benchmark everytime a recommendation occurs, however you can prepare a subset of your dataset and let AMT guys evaluate the system using GUI you provide. Again, a sample data collection system is described here: http://ai.stanford.edu/~jkrause/papers/fgvc13.pdf

I have no intention of advertising it but AMT is also an economical workaround.

score 1 · Answer 2 · answered Jan 15 '14 at 16:04

the basic strategy in [supervised] machine learning is to split the data into a "learning/training set" and a "test set". the algorithm when trained has no access to the test set which also acts something like a scientific control group. the algorithm attempts to learn a pattern. then you compare its performance on the data that it was "blind" to during training (also called "held out"), and if training/"generalization" occurred, the performance on the test data will be similar. briefly stated when there is a large drop in performance on training vs test data its called "overfitting" or "memorization" and the algorithm learned irrelevant details of the training data and did not "generalize". one "rule of thumb" balance on training vs test set is sometimes 80% vs 20% ie 4/5ths of the data is used for training and 1/5th is for test.

How to evaluate recommendation engine without ground truth?

2 Answers2