Why does parallelising slow down this simple problem against looping through all the data?

Question

I've been using multiprocessing and parallelisation for the first time this week on a very large data set using 32 CPUs. I decided to explore it for a smaller task just to see if I could learn anything, just on the 4 CPUs of my Mac.

I created a task to add 100 to every element in a 500,000 element list. To my surprise, I noticed that batching this data and using Python's parallelising tools to implement this actually slowed it down hugely, compared to just looping through the 500,000 elements and adding 1.

I'd like to understand why.

Consider the two methods for doing this task below:

import numpy as np 
from sqlitedict import SqliteDict
from multiprocessing import Pool, cpu_count
from gensim.corpora.wikicorpus import init_to_ignore_interrupt
from itertools import zip_longest
import timeit as t

def grouper(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n 
    return zip_longest(*args, fillvalue=fillvalue)

class Add100ToData():
    def __init__(self):
        self.data = [np.random.randint(0, 100) for _ in range(500000)]

    def add100(self):
        for i in range(len(self.data)):
            self.data[i] = self.data[i] + 100
        return self.data

class Add100ToDataMultiprocess():
    def __init__(self):
        self.data = [np.random.randint(0, 100) for _ in range(500000)]

    def process_batch(self, batch):
        new_data = []
        for i in batch:
            new_data.append(i + 100)
        return new_data

    def add100(self, batch_size):
        processes = cpu_count()
        pool = Pool(processes, init_to_ignore_interrupt)
        gr = grouper(self.data, batch_size)

        for batch_result in pool.imap(self.process_batch, gr):
            count = 0
            for i in batch_result:
                count += 1
                self.data[count] = i
        return self.data

if __name__ == "__main__":
    add1 = Add100ToData()
    start = t.default_timer()
    final1 = add1.add100()
    end = t.default_timer()
    print("Looping run-time: {:.2f} seconds".format(end - start))

    add2 = Add100ToDataMultiprocess()
    start = t.default_timer()
    final2 = add2.add100()
    end = t.default_timer()
    print("Looping run-time: {:.2f} seconds".format(end - start))

This gives me:

Looping run-time: 0.13 seconds
Multiprocessing run-time: 1.23 seconds

Why is there an improvement in simply looping through the data as opposed to parallelising and batching with simple tasks?

When I was doing this for a far more labour-intensive task (one task was transforming 800,000 sentences into their 300-dimensional word embeddings, and another was applying a classifier to these), I gained huge speed improvements using 32 CPUs on the Google cloud, with a very similar code structure to this.

Can someone help me to understand why I'm not getting speed improvements here?

score 3 · Accepted Answer · answered Aug 02 '18 at 16:53

Parallelism has costs. The processes have to be scheduled, communicate with each other, manage resources, etc. In return you can do multiple things at the same time.

When you have a lot of slow tasks that can be done independently, parallel processing will speed things up a lot.

But when you try to parallelize an easy task it might take longer to handle the overhead than to actually do the work. That seems to be the case here.

Why does parallelising slow down this simple problem against looping through all the data?

1 Answers1