I'm working on an application that is a "predictive-model-as-a-service", structured as follows:
- train a model offline
- periodically upload model parameters to a "prediction server"
- the prediction server takes as input a single observation, and outputs a prediction
I'm trying to use patsy, but running into the following problem: When a single prediction comes in, how do I convert it to the right shape such that it looks like a row of the training data?
The patsy documentation provides an example when the DesignInfo from the training data is available in memory: http://patsy.readthedocs.io/en/latest/library-developers.html#predictions
# offline model training
import patsy
data = {'animal': ['cat', 'cat', 'dog', 'raccoon'], 'cuteness': [3, 6, 10, 4]}
eq_string = "cuteness ~ animal"
dmats = patsy.dmatrices(eq_string,data)
design_info = dmats[1].design_info
train_model(dmats)
# online predictions
input_data = {'animal': ['raccoon']}
# if the DesignInfo were available, I could do this:
new_dmat = build_design_matrices([design_info], input_data)
make_prediction(new_dmat, trained_model)
And then the output:
[DesignMatrix with shape (1, 3)
Intercept animal[T.dog] animal[T.raccoon]
1 0 1
Terms:
'Intercept' (column 0)
'animal' (columns 1:3)]
Notice that this row is the same shape as the training data; it has a column for animal[T.dog]. In my application, I don't have a way to access the DesignInfo to build the DesignMatrix for the new data. Concretely, how would the prediction server know how many other categories of animal are in the training data and in what order?
I thought I could just pickle it but it turns out this isn't supported yet: https://github.com/pydata/patsy/issues/26
I could also simply persist the matrix columns as a string and rebuild the matrix from that online, but this seems a bit fragile.
Is there a good way to do this?