简体   繁体   English

Python - 使用Queue时不会关闭多处理线程

[英]Python - multiprocessing threads don't close when using Queue

This is for Python 3.x 这适用于Python 3.x.

I'm loading records from a CSV file in chunks of 300, then spawning worker threads to submit them to a REST API. 我正在以300块的形式从CSV文件加载记录,然后生成工作线程以将它们提交到REST API。 I'm saving the HTTP response in a Queue, so that I can get a count for the number of skipped records once the entire CSV file is processed. 我将HTTP响应保存在队列中,这样我就可以在处理完整个CSV文件后获得跳过记录数的计数。 However, after I added a Queue to my worker, the threads don't seem to close anymore. 但是,在我向我的工作人员添加了一个队列后,线程似乎不再关闭了。 I want to monitor the number of thread for 2 reasons: (1) once all are done, I can calculate and display the skip counts and (2) I want to enhance my script to spawn no more than 20 or so threads, so I don't run out of memory. 我想监视线程的数量有两个原因:(1)一旦完成,我可以计算并显示跳过计数和(2)我想增强我的脚本产生不超过20个左右的线程,所以我不要耗尽内存。

I have 2 questions: 我有两个问题:

  • Can someone explain why the thread stays active when using q.put() ? 有人可以解释为什么线程在使用q.put()时保持活动状态?
  • Is there a different way to manage the # of threads, and to monitor whether all threads are done? 是否有不同的方法来管理线程数,并监控是否所有线程都已完成?

Here is my code (somewhat simplified, because I can't share the exact details of the API I'm calling): 这是我的代码(有些简化,因为我无法分享我正在调用的API的确切细节):

import requests, json, csv, time, datetime, multiprocessing

TEST_FILE = 'file.csv'

def read_test_data(path, chunksize=300):
    leads = []
    with open(path, 'rU') as data:
        reader = csv.DictReader(data)
        for index, row in enumerate(reader):
            if (index % chunksize == 0 and index > 0):
                yield leads
                del leads[:]
            leads.append(row)
        yield leads

def worker(leads, q):
    payload = {"action":"createOrUpdate","input":leads}
    r = requests.post(url, params=params, data=json.dumps(payload), headers=headers)
    q.put(r.text) # this puts the response in a queue for later analysis
    return

if __name__ == "__main__":
    q = multiprocessing.Queue() # this is a queue to put all HTTP responses in, so we count the skips
    jobs = []
    for leads in read_test_data(TEST_FILE): # This function reads a CSV file and provides 300 records at a time
        p = multiprocessing.Process(target=worker, args=(leads,q,))
        jobs.append(p)
        p.start()
    time.sleep(20) # checking if processes are closing automatically (they don't)
    print(len(multiprocessing.active_children())) ## always returns the number of threads. If I remove 'q.put' from worker, it returns 0

    # The intent is to wait until all workers are done, but it results in an infinite loop
    # when I remove 'q.put' in the worker it works fine
    #while len(multiprocessing.active_children()) > 0:  # 
    #    time.sleep(1)

    skipped_count = 0
    while not q.empty(): # calculate number of skipped records based on the HTTP responses in the queue
        http_response = json.loads(q.get())
        for i in http_response['result']:
            if (i['status'] == "skipped" and i['reasons'][0]['code'] == "1004"):
                skipped_count += 1
    print("Number of records skipped: " + str(skipped_count))

This is most likely because of this documented quirk of multiprocessing.Queue : 这很可能是因为这个记录的multiprocessing.Queue怪癖.Queue:

Bear in mind that a process that has put items in a queue will wait before terminating until all the buffered items are fed by the “feeder” thread to the underlying pipe. 请记住,将项目放入队列的进程将在终止之前等待,直到所有缓冲的项目由“feeder”线程提供给底层管道。 (The child process can call the cancel_join_thread() method of the queue to avoid this behaviour.) (子进程可以调用队列的cancel_join_thread()方法以避免此行为。)

This means that whenever you use a queue you need to make sure that all items which have been put on the queue will eventually be removed before the process is joined. 这意味着无论何时使用队列,您都需要确保在加入进程之前最终删除已放入队列的所有项目。 Otherwise you cannot be sure that processes which have put items on the queue will terminate. 否则,您无法确定已将项目放入队列的进程将终止。 Remember also that non-daemonic processes will be joined automatically. 还要记住,非守护进程会自动加入。

Basically, you need to make sure you get() all the items from a Queue to guarantee that all the processes which put something into that Queue will be able to exit. 基本上,你需要确保你get()所有从项目Queue ,以保证所有的流程put东西放到该Queue就可以退出。

I think in this case you're better off using a multiprocessing.Pool , and submitting all your jobs to multiprocessing.Pool.map . 我认为在这种情况下,您最好使用multiprocessing.Pool ,并将所有作业提交到multiprocessing.Pool.map This simplifies things significantly, and gives you complete control over the number of processes running: 这大大简化了事情,并使您可以完全控制运行的进程数:

def worker(leads):
    payload = {"action":"createOrUpdate","input":leads}
    r = requests.post(url, params=params, data=json.dumps(payload), headers=headers)
    return r.text

if __name__ == "__main__":
    pool = multiprocessing.Pool(multiprocessing.cpu_count() * 2)  # cpu_count() * 2 processes running in the pool
    responses = pool.map(worker, read_test_data(TEST_FILE))

    skipped_count = 0
    for raw_response in responses:
        http_response = json.loads(raw_response)
        for i in http_response['result']:
            if (i['status'] == "skipped" and i['reasons'][0]['code'] == "1004"):
                skipped_count += 1
    print("Number of records skipped: " + str(skipped_count))

If you're worried about the memory cost of converting read_test_data(TEST_FILE) into a list (which is required to use Pool.map ), you can use Pool.imap instead. 如果您担心将read_test_data(TEST_FILE)转换为列表(使用Pool.map所需read_test_data(TEST_FILE)的内存成本,则可以使用Pool.imap

Edit: 编辑:

As I mentioned in a comment above, this use-case looks like it's I/O-bound, which means you may see better performance by using a multiprocessing.dummy.Pool (which uses a thread pool instead of a process pool). 正如我在上面的评论中提到的,这个用例看起来像是I / O绑定的,这意味着你可以通过使用multiprocessing.dummy.Pool (它使用线程池而不是进程池)来看到更好的性能。 Give both a try and see which is faster. 试一试,看看哪个更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM