简体   繁体   中英

Python3 can't pickle _thread.RLock objects on list with multiprocessing

I'm trying to parse the websites that contain car's properties(154 kinds of properties). I have a huge list( name is liste_test ) that consist of 280.000 used car announcement URL.

def araba_cekici(liste_test,headers,engine):
    for link in liste_test:
        try:
            page = requests.get(link, headers=headers)
        .....
        .....

When I start my code like that:

araba_cekici(liste_test,headers,engine)

It works and getting results. But approximately in 1 hour, I could only obtain 1500 URL's properties. It is very slow, and I must use multiprocessing .

I found a result on here with multiprocessing. Then I applied to my code, but unfortunately, it is not working.

import numpy as np
import multiprocessing as multi

def chunks(n, page_list):
    """Splits the list into n chunks"""
    return np.array_split(page_list,n)

cpus = multi.cpu_count()

workers = []   
page_bins = chunks(cpus, liste_test)


for cpu in range(cpus):
    sys.stdout.write("CPU " + str(cpu) + "\n")
    # Process that will send corresponding list of pages 
    # to the function perform_extraction
    worker = multi.Process(name=str(cpu), 
                           target=araba_cekici, 
                           args=(page_bins[cpu],headers,engine))
    worker.start()
    workers.append(worker)

for worker in workers:
    worker.join()

And it gives:

TypeError: can't pickle _thread.RLock objects

I found some kind of responses with respects to this error. But none of them works(at least I can't apply to my code). Also, I tried python multiprocess Pool but unfortunately it stucks on jupyter notebook and seems this code works infinitely.

Late answer, but since this question turns up when searching on Google: multiprocessing sends the data to the worker processes via a multiprocessing.Queue , which requires all data/objects sent to be picklable .

In your code, you try to pass header and engine , whose implementations you don't show. (Since header holds the HTTP request header, I suspect that engine is the issue here.) To solve your issue, you either have to make engine picklable, or only instantiate engine within the worker process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM