Questions tagged [version-control]

15 questions
63
votes
9 answers

Tools and protocol for reproducible data science using Python

I am working on a data science project using Python. The project has several stages. Each stage comprises of taking a data set, using Python scripts, auxiliary data, configuration and parameters, and creating another data set. I store the code in…
Yuval F
  • 761
  • 1
  • 6
  • 7
63
votes
11 answers

How to deal with version control of large amounts of (binary) data

I am a PhD student of Geophysics and work with large amounts of image data (hundreds of GB, tens of thousands of files). I know svn and git fairly well and come to value a project history, combined with the ability to easily work together and have…
Johann
  • 741
  • 1
  • 5
  • 5
7
votes
1 answer

A the end of a big DS project, should I make trained models available on GitHub?

I almost completed two big Data Science personal projects based on Deep Learning. They are the fanciest models I've implemented up to now, and I'm pushing all my code on GitHub. Do you advice to upload trained models too? Or should I let other users…
Leevo
  • 6,445
  • 3
  • 18
  • 52
4
votes
2 answers

Merging data approach in Data Science projects

This is more of an infrastructural question about data science. How would you manage data merging in your GitHub repository? As an example, as a data scientist I might be working on my branch and developing code, analysis ecc... ecc... merging code…
Mattia Surricchio
  • 421
  • 3
  • 5
  • 15
3
votes
1 answer

ValueError when loading sklearn DecisionTreeClassifier pickle in Python 3.10

I'm encountering an issue while transitioning from Python 3.7.3 to Python 3.10 due to the deprecation of the older version. The problem arises when attempting to load a pickled sklearn DecisionTreeClassifier model. Environment: Original: Python…
user3369545
  • 131
  • 1
2
votes
1 answer

What is the difference between Pachyderm and Git?

I learned that tools like Pachyderm version-control data, but I cannot see any difference between that tool with Git. I learned from this post that: It holds all your data in a central accessible location It updates all depending data sets when…
Lerner Zhang
  • 536
  • 3
  • 10
2
votes
1 answer

What is the right way to store datasets for a CNN project

Our image classification project has thousands of raw photos, masks and reshaped images. We store source code in git. But datasets don't belong to source code version control. How should we store thee sets of images?
sixtytrees
  • 213
  • 1
  • 10
2
votes
1 answer

Dataset management: What are some strategies/solutions for efficiently storing datasets with their versions?

The problem: I've N classification models (independent), for each of these N models, I've different versions (eg: V0, V1, ..., Vfinal_production,Vexperimental). I'm looking for a way to store my datasets efficiently on the cloud (for…
ngub05
  • 333
  • 1
  • 2
  • 8
1
vote
0 answers

Embedding git commit into the resulting data

Our pipeline works something like that: Collect bunch of raw data (10-100 GB) from microscope Process data using MATLAB scripts Change few parameters based on raw data, as well as add new features to the scripts Commit the scripts with new features…
1
vote
0 answers

Keras trained model exported with older version of Keras ( < 2.2.0 )

Is it possible to update a trained model saved in a file without retraining it ? I found the model on the web and I would like to use it but it uses Merge layers which are not supported by newer version of Keras, making it impossible to load with…
1
vote
0 answers

Suggestion on practice to model and dataset version documentation

I want to steer my question towards the practical side of ML. As a practitioner, I feel keeping different versions of models and datasets is difficult. From time to time I need to revisit my data and model code to verify if certain assumptions are…
Student
  • 421
  • 2
  • 10
0
votes
1 answer

How to version data science projects with large files

I am working on a project with large data files (~300MB). I want to version my work along with the data files so that it is always available online. I tried using git-lfs but it has a 1GB/month bandwidth limit, beyond which you're blocked for a…
fireball.1
  • 103
  • 4
0
votes
1 answer

I cannot run MNIST MWE (hello world for DL)

I have installed Anaconda and want tor run MWE for MNIST but I'me getting this error: D:\STAZENE_last\Anaconda2\Lib\site-packages\torch\cuda\__init__.py:107: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found…
0
votes
0 answers

version control for code and output models

I have a question about version control for both code and the models it generates. We are developing ML models that often involve hyperparameters and so we might do many runs with different hyperparameter settings. We currently store the output…
-3
votes
1 answer

Extract all releases from GIT repository

I would like to examine an existing Git repository and extract all defined releases into a subfolder. For example, if application A had 26 releases, my bash script would extract all 26 versions into subfolders such as: A/(folder) for each of the…