3

Consider the application:

  • We have a set of users and items.
  • Users can perform different action types (think browsing, clicking, upvoting etc.) on different items.
  • Users and items accumulate a "profile" for each action type.
  • For users such profile is a list of items on which they had performed a given action type
  • For items such profile is a list of users who performed a given action type on them
  • We assume that accumulated profiles define future actions.
  • We want to predict the action a user will take using supervised learning (classification with probability estimation)

Consider the following problem:

  • These profiles can be very sparse (millions of items and 100 million users) and it is not feasible to use them directly as features
  • We would like to compute "compressed" profiles (eigenprofiles?:)) with dimensionality < 300 that can then be efficiently stored and fed to different classification algorithms
  • Before you say "Use TruncatedSVD/Collapsed Gibbs Sampling/Random Projections on historical data" bare with me for a second.
  • Enters concept drift.
  • New items and users are being introduced all the time to the system.
  • Old users churn.
  • Old items churn.
  • At some point there are items with most of the users never seen in historical data and users with only fresh items.
  • Before you say "retrain periodically", remember that we have a classifier in the pipeline that was taught on the "historic" decomposition and the new decomposition could assign entirely different "meaning" to cells of output vectors (abs(decompose_v1(sample)[0] - decompose_v2(sample)[0]) >> epsilon) rendering this classifier unusable.

Some requirements:

  • The prediction service has to be available 24/7.
  • The prediction cannot take more than 15ms and should use a maximum of 4 cpu cores (preferably only one)

Some ideas I had so far:

    • We could retrain the classifier on the new decomposition but this would mean that we have to re-run the decomposition on the whole training dataset (with snapshot of profiles at the time of the event we want to predict) and the whole database (all current profiles) plus store it.

    • To make this work we would have to have a second database for storing the decomposed profiles that would be hot-swapped once the new retrained model is ready and all profiles have been decomposed.

    • This approach is quite inefficient in both computational resources and storage resources (this is expensive storage because the retrieval has to be super-fast)

    • We could retrain the classifier as in solution 1. But do the decomposition ad_hoc.
    • This puts a lot of constraints on the speed of the decomposition (has to have sub-millisecond computation times for a single sample).
    • This does a lot of redundant computation (especially for item profiles) unless we add an extra caching layer.
    • This avoids redundant storage and redundant computation of churned users/items at the cost of extra prediction latency and extra caching layer complexity.
  1. <---- Please help me here

    • We could use one of online learning algorithms such as VFT or Mondrian Forests for the classifier - so no more retraining + nice handling of concept drift.
    • We would need an online algorithm for decomposition that satisfies strict requirements: a) at least a part of output vectors should be stable between increments (batches). b) it can introduce new features to account for new variance in the data but should do so at a controllable rate c) should not break if it encounters new users/items

Questions/points of action:

  • Please evaluate my proposed solutions and propose alternatives
  • Please provide algorithms suitable for online learning and online decomposition (if they exist) as described in alternative 3. Preferably with efficient python/scala/java implementations with a sufficient layer of abstraction to use them in a web service (python scripts that take in a text file as dataset would be much less valuable than scikit modules)
  • Please provide links to relevant literature that dealt with similar problems/describes algorithms that could be suitable
  • Please share experiences/caveats/tips that you learned while dealing with similar problems

Some background reading that you may find useful:

Disclaimer: Our application is not strictly ad conversion prediction and some problems such as rarity do not apply. The event we would like to predict has 8 classes and occurs c.a. 0.3%-3% of times a user browses an item.

JohnnyM
  • 83
  • 5

1 Answers1

2

My take:

  • I agree with the issues raised in 1., so not much to add here - retraining and storage is indeed inefficient
  • Vowpal Wabbit http://hunch.net/~vw/ would be my first choice
  • stability of output between increments is really more of a data than algorithm feature - if you have plenty of variation on input, you won't have that much stability on output (at least not by default)
  • hashing can take care of the variation - you can control by a combination of three parameters: the size of the hashing table and the l1/l2 regularization
  • same for the new features / users (I think - most of the applications i used it had a ercord representing a user clicking or not, so new users / ads were sort of treated "the same")
  • normally I use VW from the command line, but an example approach (not too elegant) for controlling from Python is given here: http://fastml.com/how-to-run-external-programs-from-python-and-capture-their-output/
  • if you prefer sth purely Python, then a version (without decomposition) of an online learner in the Criteo spirit can be found here: https://www.kaggle.com/c/tradeshift-text-classification/forums/t/10537/beat-the-benchmark-with-less-than-400mb-of-memory
  • I am not sure how to handle the concept drift - haven't paid that much attention to it so far beyond rolling statistics: for the relevant variables of interest, keep track of mean / count over recent N periods. It is a crude approach, but does seem to get the job done it terms of capturing lack of "stationarity"
  • helpful trick 1: single pass over data before first run to create per feature dictionary and flag certain values as rare (lump them into a single value)
  • helpful trick 2: ensembling predictions from more than one model (varying interaction order, learning rate)
kpb
  • 379
  • 1
  • 3