Dataset management: What are some strategies/solutions for efficiently storing datasets with their versions?

Question

The problem: I've N classification models (independent), for each of these N models, I've different versions (eg: V₀, V₁, ..., V_{final_production},V_experimental). I'm looking for a way to store my datasets efficiently on the cloud (for redudency).

Note: We're not talking about BigData here.

Current Solution: Created a private GitHub repo. Made N directories and inside, pushed different dataset versions as different files.

Are there better solutions for this (because I feel VCS is an overkill for this problem)?

ngub05 · Accepted Answer · 2018-01-22T08:15:48.957

I used my existing directory structure with Git LFS and it works well.

Here are the advantages I got by using Git LFS other than the alternative solutions.

It saves space: Since Git LFS manages my dataset files (100MB to 1GB) with pointers and keeping only the minimum in the local memory.
It works seamlessly with a git enabled repository: I was planning to use some storage solution (like Amazon S3) specific for my datasets. But I would have to make extra efforts to manage multiple version and keep my dataset files in sync.

Dataset management: What are some strategies/solutions for efficiently storing datasets with their versions?

1 Answers1