How to persist patsy DesignInfo?

Question

I'm working on an application that is a "predictive-model-as-a-service", structured as follows:

train a model offline
periodically upload model parameters to a "prediction server"
the prediction server takes as input a single observation, and outputs a prediction

I'm trying to use patsy, but running into the following problem: When a single prediction comes in, how do I convert it to the right shape such that it looks like a row of the training data?

The patsy documentation provides an example when the DesignInfo from the training data is available in memory: http://patsy.readthedocs.io/en/latest/library-developers.html#predictions

# offline model training
import patsy

data = {'animal': ['cat', 'cat', 'dog', 'raccoon'], 'cuteness': [3, 6, 10, 4]}
eq_string = "cuteness ~ animal"


dmats = patsy.dmatrices(eq_string,data)
design_info = dmats[1].design_info
train_model(dmats)


# online predictions
input_data = {'animal': ['raccoon']}

# if the DesignInfo were available, I could do this:
new_dmat = build_design_matrices([design_info], input_data)
make_prediction(new_dmat, trained_model)

And then the output:

[DesignMatrix with shape (1, 3)
   Intercept  animal[T.dog]  animal[T.raccoon]
           1              0                  1
   Terms:
     'Intercept' (column 0)
     'animal' (columns 1:3)]

Notice that this row is the same shape as the training data; it has a column for animal[T.dog]. In my application, I don't have a way to access the DesignInfo to build the DesignMatrix for the new data. Concretely, how would the prediction server know how many other categories of animal are in the training data and in what order?

I thought I could just pickle it but it turns out this isn't supported yet: https://github.com/pydata/patsy/issues/26

I could also simply persist the matrix columns as a string and rebuild the matrix from that online, but this seems a bit fragile.

Is there a good way to do this?

Can you retain the `design_info` in the server? (It seems like that would happen automatically.) Then the client would just send the server the new `input_data`, and the server would run the `new_dmat` and `make_prediction` lines. Or do you need to be able to shutdown and restart the server without retraining it? In that case, it sounds like you'd need to save both the original `dmats` and also the parameters that were found by `train_model()`. Is that what you're looking for? — Matthias Fripp, May 16 '17 at 20:22

score 1 · Answer 1 · answered May 16 '17 at 21:10

Assuming your goal is to be able to restart the server without retraining, it looks like your best option (until patsy implements pickling) would be to pickle data, eq_string and whatever parameters are calculated by train_model. Then upon restarting the server, you could unpickle data and eq_string and call dmats = patsy.dmatrices(eq_string,data) again. This should run pretty fast, since it's not really training a model, just preprocessing your data. Then you would also unpickle the parameters calculated by train_model (not shown in the question), and the server should be ready to make predictions for new inputs.

Note that if you are splitting this into client and server components, the server should do everything discussed above, and the client should just send it the input_data defined in your question. (The client doesn't ever need to see dmats or design_info.)

score 0 · Answer 2 · answered Apr 24 '22 at 19:05

Is there any progress regarding this issue? I know this is something very much needed.

Github still contains that issue.

Perhaps something simple like this?

import h5py

def save_patsy(patsy_step, filename):
    """Save the coefficients of a linear model into a .h5 file."""
    with h5py.File(filename, 'w') as hf:
        hf.create_dataset("design_info",  data=patsy_step.design_info_)

def load_coefficients(patsy_step, filename):
    """Attach the saved coefficients to a linear model."""
    with h5py.File(filename, 'r') as hf:
        design_info = hf['design_info'][:]
    patsy_step.design_info_ = design_info


save_patsy(pipe['patsy'], "clf.h5")

Hower, still not working. But I think this is the first step.

According to the patsy README it is no longer under active development as of August 2021. But it refers to a library called [Formulaic](https://github.com/matthewwardrop/formulaic) which you might want to look into. — exp1orer, Apr 25 '22 at 20:18

How to persist patsy DesignInfo?

2 Answers2