I've been using multiprocessing and parallelisation for the first time this week on a very large data set using 32 CPUs. I decided to explore it for a smaller task just to see if I could learn anything, just on the 4 CPUs of my Mac.
I created a task to add 100 to every element in a 500,000 element list. To my surprise, I noticed that batching this data and using Python's parallelising tools to implement this actually slowed it down hugely, compared to just looping through the 500,000 elements and adding 1.
I'd like to understand why.
Consider the two methods for doing this task below:
import numpy as np
from sqlitedict import SqliteDict
from multiprocessing import Pool, cpu_count
from gensim.corpora.wikicorpus import init_to_ignore_interrupt
from itertools import zip_longest
import timeit as t
def grouper(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
class Add100ToData():
def __init__(self):
self.data = [np.random.randint(0, 100) for _ in range(500000)]
def add100(self):
for i in range(len(self.data)):
self.data[i] = self.data[i] + 100
return self.data
class Add100ToDataMultiprocess():
def __init__(self):
self.data = [np.random.randint(0, 100) for _ in range(500000)]
def process_batch(self, batch):
new_data = []
for i in batch:
new_data.append(i + 100)
return new_data
def add100(self, batch_size):
processes = cpu_count()
pool = Pool(processes, init_to_ignore_interrupt)
gr = grouper(self.data, batch_size)
for batch_result in pool.imap(self.process_batch, gr):
count = 0
for i in batch_result:
count += 1
self.data[count] = i
return self.data
if __name__ == "__main__":
add1 = Add100ToData()
start = t.default_timer()
final1 = add1.add100()
end = t.default_timer()
print("Looping run-time: {:.2f} seconds".format(end - start))
add2 = Add100ToDataMultiprocess()
start = t.default_timer()
final2 = add2.add100()
end = t.default_timer()
print("Looping run-time: {:.2f} seconds".format(end - start))
This gives me:
Looping run-time: 0.13 seconds
Multiprocessing run-time: 1.23 seconds
Why is there an improvement in simply looping through the data as opposed to parallelising and batching with simple tasks?
When I was doing this for a far more labour-intensive task (one task was transforming 800,000 sentences into their 300-dimensional word embeddings, and another was applying a classifier to these), I gained huge speed improvements using 32 CPUs on the Google cloud, with a very similar code structure to this.
Can someone help me to understand why I'm not getting speed improvements here?