简体   繁体   English

使用队列在python中进行多线程

[英]Multithreading in python using queue

I am quite new to Python and I need to implement multithreading in my code. 我对Python很陌生,我需要在代码中实现多线程。

I have a huge .csv file (million lines) as my input. 我有一个巨大的.csv文件(百万行)作为输入。 I read the line, make a rest request for each line, do some processing on each line and write the output into another file. 我读了这一行,对每一行进行了休息请求,对每一行进行了一些处理,然后将输出写入另一个文件。 The ordering of lines in input/output file does matter . 输入/输出文件中的行顺序确实很重要。 Right now I am doing this line by line. 现在,我正在逐行执行此操作。 I want to run the same code, but in parallel, ie read 20 lines of input from .csv file and make the rest call in parallel so that my program is faster. 我想运行相同的代码,但要并行运行,即从.csv文件读取20行输入,并并行进行其余调用,以便我的程序更快。

I have been reading up on http://docs.python.org/2/library/queue.html , but I read about the python GIL issue which says the code will not run faster even after multithreading. 我一直在阅读http://docs.python.org/2/library/queue.html ,但是我读到有关python GIL的问题,该问题说即使在多线程之后,代码也不会运行得更快。 Is there any other way to achieve multithreading in a simple way? 还有其他简便方法可以实现多线程吗?

Can you break the .csv file into multiple smaller files? 您可以将.csv文件分成多个较小的文件吗? If you can, then you could have another program running multiple versions of your processer. 如果可以,则可以让另一个程序运行处理器的多个版本。

Say the files were all named file1 , file2 , etc. and your processer took the filename as an argument. 假设文件都被命名为file1file2等,并且您的处理器将文件名作为参数。 You could have: 你可以有:

import subprocess
import os
import signal

for i in range(1,numfiles):
    program = subprocess.Popen(['python'], "processer.py", "file" + str(i))
    pid = program.pid

    #if you need to kill the process:
    os.kill(pid, signal.SIGINT)

Python releases GIL on IO. Python在IO上发布GIL。 If most of the time is spent doing rest requests; 如果大部分时间都花在休息请求上; you could use threads to speed up processing: 您可以使用线程来加快处理速度:

try:
    from gevent.pool import Pool # $ pip install gevent
    import gevent.monkey; gevent.monkey.patch_all() # patch stdlib
except ImportError: # fallback on using threads
    from multiprocessing.dummy import Pool

import urllib2    

def process_line(url):
    try:
        return urllib2.urlopen(url).read(), None
    except EnvironmentError as e:
        return None, e

with open('input.csv', 'rb') as file, open('output.txt', 'wb') as outfile:
    pool = Pool(20) # use 20 concurrent connections
    for result, error in pool.imap_unordered(process_line, file):
        if error is None:
            outfile.write(result)

If input/output order should be the same; 如果输入/输出顺序应该相同; you could use imap instead of imap_unordered . 您可以使用imap代替imap_unordered

If your program is CPU-bound; 如果您的程序是CPU绑定的; you could use multiprocessing.Pool() that creates multiple processes instead. 您可以使用multiprocessing.Pool()来创建多个进程。

See also Python Interpreter blocks Multithreaded DNS requests? 另请参见Python解释器阻止多线程DNS请求?

This answer shows how to create a thread pool manually using threading + Queue modules . 该答案显示了如何使用线程+队列模块手动创建线程池

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM