5

See R's {drake}. It allows you to define a reproducible pipeline

plan <- drake_plan(
  raw_data = readxl::read_excel(file_in("raw_data.xlsx")),
  data = raw_data %>%
    mutate(Species = forcats::fct_inorder(Species)),
  hist = create_plot(data),
  fit = lm(Sepal.Width ~ Petal.Width + Species, data),
  report = rmarkdown::render(
    knitr_in("report.Rmd"),
    output_file = file_out("report.html"),
    quiet = TRUE
  )
)

# call the pipeline
make(plan)

The great thing about drake is you that you can reload any of raw_data, data, hist, fit, report at any point. And if you change part of the code and make(plan) and {drake} will figure out which has change and just run that.

xiaodai
  • 640
  • 1
  • 5
  • 13

3 Answers3

3

Sklearn has pipeline. If you have fit and transform attributes iteratively, you can make them pipeline by Pipeline class in sklearn.pipeline.

Read the docs:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

Additionally you can save and load a pipeline object by joblib.dump and joblib.load.

Ilker Kurtulus
  • 884
  • 1
  • 5
  • 13
2

For larger projects snakemake is a way to go for Python (it extends Python syntax, valid Python is valid snakemake). It originates in bioinformatics and even has its own publication; it is widley adopted and used by many projects (see the literature list in the first link or the citations for the linked article).

For Jupyter notebook based projects, I made an experiment called nbpipeline which you may be interested in.

krassowski
  • 121
  • 2
1

Ploomber works the same way, it keeps track of your source code and it only runs outdated steps to bring your pipeline up-to-date: https://github.com/ploomber/ploomber

Disclaimer: I'm the project's author

Eduardo
  • 111
  • 3