Avoid reloading DataFrame between different python kernels

Question

Is there a way of keeping a variable (large table / data frame) in memory and share it across multiple ipython notebooks?

I'd be looking for something, which is conceptually similar to MATLAB's persistent variables. There it is possible to call a custom function / library from multiple individual editors (notebooks), and have that external function cache some result (or large table).

Mostly I would like to avoid reloading a heavily used table (which is loaded through a custom library that is is called from the notebooks ), since reading it takes around 2-3 minute whenever I start a new analysis.

Tagar · Accepted Answer · 2017-01-26T20:01:49.067

If it's important for your use cases, you could try switching to Apache Zeppelin. As all Spark notebooks there share the same Spark context, same Python running environment. https://zeppelin.apache.org/

So what you're asking happens natively in Zeppelin. Or to be complete, it is an option to share the same Spark context / same Python envrionment between all Spark notebooks (they're called 'notes' in Zeppelin):

So you can choose to share context Globally (default Zeppelin's behavior), Per Note (the only possible Jupyter's behavior), or Per User.

If you can't / don't want to switch to Zeppelin, look at other options of sharing common dataframes between your notebooks using:

ps. You can't import ipynb files to Zeppelin currently as of now (it has its own notebook format stored as a json file), until https://issues.apache.org/jira/browse/ZEPPELIN-1793 is implemented; although it's not that hard to convert them manually in most cases.

Avoid reloading DataFrame between different python kernels

1 Answers1