Avoid hardware limitation while competing in Kaggle?

Question

I've learned machine learning via textbooks and examples, which don't delve into the engineering challenges of working with "big-ish" data like Kaggle's.

As a specific example, I'm working on the New York taxi trip challenge. It's a regression task for ~ 2 million rows and 20 columns.

My 4GB-RAM laptop can barely do EDA with pandas and matplotlib in Jupyter Notebook. However, when I try to build a random forest with 1000 trees, it hangs (e.g. Kernel restart error in Jupyter Notebook).

To combat this, I set up a 16GB-RAM desktop. I then ssh in, start a browser-less Jupyter Notebook kernel, and connect my local Notebook to that kernel. However, I still max out that machine.

At this point, I'm guessing that I need to run my model training code as a script.

Will this help with my machine hanging?
What's the workflow to store the model results and use it later for prediction? Do you use a Makefile to keep this reproducible?
Doing this also sacrifices the interactivity of Jupyter Notebook -- is there a workflow that maintains the interactivity?

My current toolkit is RStudio, Jupyter Notebook, Emacs, but am willing to pick up new things.

score 1 · Answer 1 · answered May 07 '21 at 01:27

Yes - a Python script will have less overhead than a Jupyter Notebook
Pickle is the standard way to store a scikit-learn model, see model persistence documentation.
The two primary ways to scale Jupyter Notebooks is vertical (rent a bigger machine on a cloud service provider) or horizontal (spin-up a cluster).

Avoid hardware limitation while competing in Kaggle?

1 Answers1