Dividing a dataset to parallelize machine learning training on the cloud

Question

I'm very new to machine learning. I am doing a project for a subject called parallel and distributed computing, in which we have to speed up a heavy computation using parallelism or distributed computing. My idea was to have a dataset divided in equal parts, and for each subset to have a neural network to be trained on a separate machine in the cloud. Once the models are trained, they would be returned back to me and somehow combined into a single model. I am aware of federated learning but it doesn't quite fit my scenario of actually sending and dividing the dataset into the cloud. Does someone know any feasible approaches (maybe a variant of federated learning) of how one would do this?

score 0 · Answer 1 · answered Apr 27 '21 at 16:05

There are many ways to parallelism machine learning. It is often better to distribute the model parameters, not the data.

Training models only a subset of data will result in worse parameter estimates than training a model on random samples of the data.

Additionally, moving data around is more expensive than moving parameters.

Dividing a dataset to parallelize machine learning training on the cloud

1 Answers1