简体   繁体   English

多处理问题

[英]multiprocessing issue

So I am just trying to multiprocess and read each line in a text doc. 所以我只想尝试多处理并阅读文本doc中的每一行。 There are 660918 lines, all of which I know to be the same length. 有660918行,我知道它们的长度相同。 Although, with the following code, the length of the lines seem to change, and I cannot figure out why. 虽然,使用以下代码,行的长度似乎发生了变化,我无法弄清楚原因。

import multiprocessing

class Worker(multiprocessing.Process):
    def __init__(self,in_q):
        multiprocessing.Process.__init__(self)
        self.in_q = in_q
    def run(self):      
        while True:
            try:
                in_q.get()
                temp_line = short_file.readline()
                temp_line = temp_line.strip().split()
                print len(temp_line)
                self.in_q.task_done()
            except:                              
                break     

if __name__ == "__main__":
    num_proc = 10
    lines = 100000 #660918 is how many lines there actually are
    in_q = multiprocessing.JoinableQueue()
    File = 'HGDP_FinalReport_Forward.txt'
    short_file = open(File)

    for i in range(lines):
        in_q.put(i)    

    for i in range(num_proc):
        worker = Worker(in_q)
        worker.start()
    in_q.join() 

You're opening a file in the main process, then reading from that file in the child processes. 您正在主进程中打开一个文件,然后从子进程中的该文件中读取。 You can't do that. 你不能这样做。

Deep under the covers, the file object is effectively a raw file handle and a memory buffer. 在幕后,文件对象实际上是一个原始文件句柄和一个内存缓冲区。 Each process shares the file handle, but each one has its own memory buffer. 每个进程共享文件句柄,但每个进程都有自己的内存缓冲区。

Let's say all of the lines are 50 bytes each, and the memory buffer is 4096 bytes. 假设所有行都是50个字节,内存缓冲区是4096个字节。

Process 1 calls readline, which reads bytes 0-4095 from the file into its buffer, then scans that buffer for a newline, which is 50 bytes in, and it returns the first 50 bytes. 进程1调用readline,它将文件0-4095从文件读入其缓冲区,然后扫描该缓冲区以获取50行的新行,并返回前50个字节。 So far, so good. 到现在为止还挺好。

Process 2 calls readline, which reads bytes 4096-8191 from the file into its buffer, then scans that buffer for a newline. 进程2调用readline,它从文件中读取字节4096-8191到其缓冲区,然后扫描该缓冲区以获取换行符。 The first one is at 4100, which is 5 bytes in, so it returns the first 5 bytes. 第一个是4100,即5个字节,因此返回前5个字节。

And so on. 等等。

You could theoretically get around this by doing unbuffered I/O, but really, why? 理论上你可以通过做无缓冲的I / O来解决这个问题,但实际上,为什么呢? Why not just read the lines in your main process? 为什么不直接阅读主流程中的行? Besides avoiding this problem, that would also probably improve parallelism—the I/O is inherently sequential, so all of those processes will spend most of their time blocked on I/O, which means they're not doing you any good. 除了避免这个问题,这也可能会改善并行性 - I / O本质上是顺序的,因此所有这些进程都会将大部分时间用在I / O上,这意味着它们对你没有任何好处。

As a side note, near the top of the loop in run, you're doing in_q.get() instead of self.in_q.get(). 作为旁注,在run的循环顶部附近,你正在使用in_q.get()而不是self.in_q.get()。 (That happens to work because in_q is a global variable that never goes away and self.in_q is just a copy of it, but you don't want to rely on that.) (这恰好起作用,因为in_q是一个永远不会消失的全局变量,self.in_q只是它的副本,但你不想依赖它。)

So, I changed it to use Pool, and it seems to work. 所以,我改变它使用Pool,它似乎工作。 Is the following better? 以下更好吗?

import multiprocessing as mp

File = 'HGDP_FinalReport_Forward.txt'
#short_file = open(File)
test = []

def pro(temp_line):
    temp_line = temp_line.strip().split()
    return len(temp_line)

if __name__ == "__main__":
    with open("HGDP_FinalReport_Forward.txt") as lines:
        pool = mp.Pool(processes = 10)
        t = pool.map(pro,lines.readlines())
    print t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM