4

I've learned machine learning via textbooks and examples, which don't delve into the engineering challenges of working with "big-ish" data like Kaggle's.

As a specific example, I'm working on the New York taxi trip challenge. It's a regression task for ~ 2 million rows and 20 columns.

My 4GB-RAM laptop can barely do EDA with pandas and matplotlib in Jupyter Notebook. However, when I try to build a random forest with 1000 trees, it hangs (e.g. Kernel restart error in Jupyter Notebook).

To combat this, I set up a 16GB-RAM desktop. I then ssh in, start a browser-less Jupyter Notebook kernel, and connect my local Notebook to that kernel. However, I still max out that machine.

At this point, I'm guessing that I need to run my model training code as a script.

  • Will this help with my machine hanging?
  • What's the workflow to store the model results and use it later for prediction? Do you use a Makefile to keep this reproducible?
  • Doing this also sacrifices the interactivity of Jupyter Notebook -- is there a workflow that maintains the interactivity?

My current toolkit is RStudio, Jupyter Notebook, Emacs, but am willing to pick up new things.

Heisenberg
  • 149
  • 4

1 Answers1

1
  • Yes - a Python script will have less overhead than a Jupyter Notebook
  • Pickle is the standard way to store a scikit-learn model, see model persistence documentation.
  • The two primary ways to scale Jupyter Notebooks is vertical (rent a bigger machine on a cloud service provider) or horizontal (spin-up a cluster).
Brian Spiering
  • 23,131
  • 2
  • 29
  • 113