The parallel tag on Data Science Stack Exchange encompasses questions related to parallel computing and processing within data science workflows. This includes discussions on distributing tasks across multiple processors or machines to enhance computational efficiency.
Questions tagged [parallel]
39 questions
27
votes
4 answers
Is there a straightforward way to run pandas.DataFrame.isin in parallel?
I have a modeling and scoring program that makes heavy use of the DataFrame.isin function of pandas, searching through lists of facebook "like" records of individual users for each of a few thousand specific pages. This is the most time-consuming…
Therriault
- 871
- 1
- 8
- 13
16
votes
3 answers
Parallel and distributed computing
What is(are) the difference(s) between parallel and distributed computing? When it comes to scalability and efficiency, it is very common to see solutions dealing with computations in clusters of machines, and sometimes it is referred to as a…
Rubens
- 4,117
- 5
- 25
- 42
15
votes
1 answer
Make Keras run on multi-machine multi-core cpu system
I'm working on Seq2Seq model using LSTM from Keras (using Theano background) and I would like to parallelize the processes, because even few MBs of data need several hours for training.
It is clear that GPUs are far much better in parallelization…
chmodsss
- 1,974
- 2
- 19
- 37
12
votes
3 answers
What needs to be done to make n_jobs work properly on sklearn? in particular on ElasticNetCV?
The constructor of sklearn.linear_model.ElasticNetCV takesn_jobs as an argument.
Quoting the documentation here
n_jobs: int, default=None
Number of CPUs to use during the cross validation. None means 1 unless in a joblib.parallel_backend context.…
OldSchool
- 261
- 1
- 2
- 8
12
votes
3 answers
Instances vs. cores when using EC2
Working on what could often be called "medium data" projects, I've been able to parallelize my code (mostly for modeling and prediction in Python) on a single system across anywhere from 4 to 32 cores. Now I'm looking at scaling up to clusters on…
Therriault
- 871
- 1
- 8
- 13
11
votes
1 answer
GPU Accelerated Data Processing for R in Windows
I'm currently taking a paper on Big Data which has us utilising R heavily for data analysis. I happen to have a GTX1070 in my pc for gaming reasons. Thus, I thought it would be really cool if I could use that to speed up some of the processing for…
Jesse Maher
- 113
- 1
- 5
4
votes
1 answer
Parallel Q-learning
I'm looking for academic papers or other credible sources focusing on the topic of parralelized reinforcement learning, specifically Q-learning.
I'm mostly interested in methods of sharing Q-table between processes (or joining/syncing them together…
Luke
- 189
- 1
- 11
4
votes
1 answer
Open source solver for large mixed integer programming task?
I'm currently using General Algebraic Modeling System (GAMS), and more specifically CPLEX within GAMS, to solve a very large mixed integer programming problem. This allows me to parallelize the process over 4 cores (although I have more, CPLEX…
rnorberg
- 203
- 2
- 7
4
votes
2 answers
Parallel active optimization
I'm trying to optimize an expensive function for which I can choose sample points. The difficulty is that many function evaluations may be computed in parallel, taking varying amounts of time. I don't know which keywords to search for to find…
Mark
- 213
- 1
- 6
2
votes
1 answer
What makes a graph algorithm a good candidate for concurrency?
GraphX is the Apache Spark library for handling graph data. I was able to find a list of 'graph-parallel' algorithms on these slides (see slide 23). However, I am curious what characteristics of these algorithms make them parallelizable.
sheldonkreger
- 1,169
- 8
- 20
2
votes
1 answer
Can parallel computing be utilized for boosting?
Since boosting is sequential, does that mean we cannot use multi-processing or multi-threading to speed it up? If my computer has multiple CPU cores, is there anyway to utilized these extra resources in boosting?
Indominus
- 155
- 6
2
votes
0 answers
What should be the value of parallel iterations in tensorflow RNN implementations?
tf.nn.dynamic_rnn() and tf.nn.raw_rnn() take in an argument called parallel_iterations. The documentation says:
parallel_iterations: (Default: 32). The number of iterations to run in parallel. Those operations which do not have any temporal…
figs_and_nuts
- 903
- 1
- 5
- 17
2
votes
0 answers
Parallel processing for feature selection in microarray dataset
I want to apply feature selection on a dataset with some 30-40K columns and 100 rows ( total size: 400MB-800MB ). To decrease the time consumed for calculations involved (feature-feature), I want to divide data in some 4-5 parts and execute all…
phoenix
- 21
- 2
2
votes
0 answers
Scalable training/updating of many small LSTM models
My situation is that I have many thousands of devices which each have their own specific LSTM model for anomaly prediction. These devices behave wildly differently so I don't think there is any way to have a shared global model, unfortunately.…
NMR
- 53
- 4
2
votes
1 answer
XGBoost GPU version not outperforming CPU on small dataset despite parameter tuning – suggestions needed
I'm currently working on a Parallel and Distributed Computing project where I'm comparing the performance of both XGBoost and CatBoost when trained on CPU vs GPU. The goal is to demonstrate how GPU acceleration can improve training time, especially…
Mxneeb
- 21
- 1