简体   繁体   中英

Multiprocessing a dictionary in python

I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).

Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):

jobs = []
    def loadGraph(dayCurrent, day2Previous):
        for dayCurrentCount  in graph[dayCurrent]:
            dayCurrentValue = graph[dayCurrent][dayCurrentCount]
            for day1Count  in graph[day2Previous]:
                day1Value = graph[day2Previous][day1Count]
                #rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
            p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
            jobs.append(p)
            p.start()
            print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))

The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.

Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.

Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.

I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!

What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.

Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.

Check out: http://docs.python.org/library/queue.html

I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python . Also, take a look at a python package called Celery: http://celeryproject.org/ . With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM