多处理python：循环运行额外时间

Question

我正在尝试使用多处理附加到 csv 文件。 我有多个要循环的 csv 文件。 此函数适用于普通的 for 循环，但不适用于多处理。 希望有人可以对此有所了解。

我的功能代码如下：

 def read_write2(j, lock):
    #i = 2
    with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
        #i = i + 1
        with open('samples2.csv','a') as file: #output file
            for line in a_file:
                lock.acquire()
                stripped_line = line.strip()
                a = len(stripped_line)
                if "©" in stripped_line or "flow" in stripped_line or a>254:
                    pass
                else:
                    file.write(stripped_line)
                    file.write("\n")
                lock.release()

我这里的多处理代码如下：

if __name__ == "__main__":
    lock = Lock()
    processes = []

    for i in range(2,fileno+1):
        print(i)
        process = Process(target=read_write2, args=(i,lock)) #creating a new process
        processes.append(process) #appending process to a processes list

    for process in processes:
        print(process)
        process.start()

    for process in processes: #loop over list to join process
        process.join() #process will finish before moving on with the script

输出如下：

7
2
3
4
5
6
7
<Process name='Process-1' parent=24328 initial>
<Process name='Process-2' parent=24328 initial>
<Process name='Process-3' parent=24328 initial>
<Process name='Process-4' parent=24328 initial>
<Process name='Process-5' parent=24328 initial>
<Process name='Process-6' parent=24328 initial>
7
7
7
7
7
7

谢谢你。

Answer 1

是的。 不去上班。 您的每个线程在文件中都有一个不同的“句柄”，因为您已经多次打开它。 您将需要打开它一次，然后将其传递给线程。

Answer 2

如前所述，您多次打开和写入同一个文件，并且没有文件锁定或同步会导致麻烦。 这可能是由于文件中的位置在进程之间没有更新，因此一个进程不知道另一个进程写入文件，并从与其他进程相同的位置开始写入文件。 有更好的方法可以做到这一点，但尝试对代码进行最小的调整，我建议使用锁来打开、写入和关闭输出文件，因此顺序如下所示：

with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
    for line in a_file:
        if ...:
            ...
        else:
            lock.acquire()
            with open('samples2.csv','a') as file: #output file
                ...
            lock.release()

尽管这会导致磁盘 I/O 的开销很大，但这应该是对代码的最小更改，以使其使用多处理工作。 整个功能将是：

def read_write2(j, lock):
    with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
        for line in a_file:
            stripped_line = line.strip()
            a = len(stripped_line)
            if "©" in stripped_line or "flow" in stripped_line or a>254:
                pass
            else:
                lock.acquire()
                with open('samples2.csv','a') as file: #output file
                    file.write(stripped_line)
                    file.write("\n")
                lock.release()

PS 根据文件的数量、文件的大小、输出的行数等很多因素，每个进程写入自己的文件，然后将输出整理到一个文件中，可能会更有效率。主循环。 这节省了大量的文件打开/关闭，并消除了对锁的需要。 例如，将函数改写如下：

def read_write2(j):
    with open('C:\\Users\\user\\Documents\\filereader\\FileFolder\\sample_new{}.csv'.format(j), "r") as a_file: #input file
        with open('samples2_{}.csv'.format(j),'a') as file: #output file
            for line in a_file:
                stripped_line = line.strip()
                a = len(stripped_line)
                if "©" in stripped_line or "flow" in stripped_line or a>254:
                    pass
                else:
                    file.write(stripped_line)
                    file.write("\n")

然后，在主代码中（在if __name__ == "__main__":下），替换以下代码：

for process in processes: #loop over list to join process
    process.join() #process will finish before moving on with the script

有了这个：

with open('samples2.csv', 'w') as out_f:
    for e, process in enumerate(processes):
        process.join()
        with open('samples2_{}.csv'.format(e+2), 'r') as in_f:
            out_f.write(in_f.read())    # NOTE: this is highly inefficient, and may consume too much memory. But that's not relevant to the question at hand.

Answer 3

谢谢大家的意见。 他们帮助我找到了答案。

答案实际上是把所有东西都放在 main 中。 它似乎工作正常，它已经解决了错误。 我正在检查 1000 和 1000 个网址。

我将所有函数声明放入 if name == " main " 并能够解决它。

再次感谢大家。 :)

多处理python：循环运行额外时间

问题描述

3 个解决方案

解决方案1
0 2022-05-30 02:12:02

解决方案2
0 2022-05-30 03:31:28

解决方案3
0 2022-05-30 06:09:36

多处理python：循环运行额外时间

问题描述

3 个解决方案

解决方案1 0 2022-05-30 02:12:02

解决方案2 0 2022-05-30 03:31:28

解决方案3 0 2022-05-30 06:09:36

解决方案1
0 2022-05-30 02:12:02

解决方案2
0 2022-05-30 03:31:28

解决方案3
0 2022-05-30 06:09:36