简体   繁体   English

使用多处理写入同一个文件

[英]Write to the same file with multiprocessing

I've been trying for a long time to write the results to my file, but since it's a multithreaded task, the files are written in a mixed way我一直在尝试将结果写入我的文件,但由于它是一个多线程任务,文件以混合方式写入

The task that adds the file is in the get_url function添加文件的任务在 get_url 函数中

And this fonction is launched via pool.submit(get_url,line)这个函数是通过 pool.submit(get_url,line) 启动的

import requests
from concurrent.futures import ThreadPoolExecutor
import fileinput
from bs4 import BeautifulSoup
import traceback
import threading


from requests.packages.urllib3.exceptions import InsecureRequestWarning
import warnings

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

count_requests = 0
host_error = 0

def get_url(url):

    try:
        global count_requests
        result_request = requests.get(url, verify=False)
        soup = BeautifulSoup(result_request.text, 'html.parser')

   
        with open('outfile.txt', 'a', encoding="utf-8") as f:
            f.write(soup.title.get_text())
            
        count_requests = count_requests + 1
    except:
        global host_error
        host_error = host_error + 1




with ThreadPoolExecutor(max_workers=100) as pool:
    for line in fileinput.input(['urls.txt']):
        pool.submit(get_url,line)
        print(str("requests success : ") + str(count_requests) + str(" | requests error ") + str(host_error), end='\r')
    


    

This is what the output looks like :这是输出的样子:

google.com - Google google.com - 谷歌

w3schools.com - W3Schools Online Web Tutorials w3schools.com - W3Schools 在线网络教程

You can use multiprocessing.Pool and pool.imap_unordered to receive processed results and write it to the file.您可以使用multiprocessing.Poolpool.imap_unordered接收处理后的结果并将其写入文件。 That way the results are written only inside main thread and won't be interleaved.这样结果只写在主线程内,不会被交错。 For example:例如:

import requests
import multiprocessing
from bs4 import BeautifulSoup


def get_url(url):
    # do your processing here:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    return soup.title.text


if __name__ == "__main__":
    # read urls from file or other source:
    urls = ["http://google.com", "http://yahoo.com"]

    with multiprocessing.Pool() as p, open("result.txt", "a") as f_out:
        for result in p.imap_unordered(get_url, urls):
            print(result, file=f_out)

I agree with Andrej Kesely that we should not write to file within get_url .我同意 Andrej Kesely 的观点,即我们不应该在get_url中写入文件。 Here is my approach:这是我的方法:

from concurrent.futures import ThreadPoolExecutor, as_completed

def get_url(url):
    # Processing...
    title = ...
    return url, title


if __name__ == "__main__":
    with open("urls.txt") as stream:
        urls = [line.strip() for line in stream]

    with ThreadPoolExecutor() as executor:
        urls_and_titles = executor.map(get_url, urls)

    # Exiting the with block: all tasks are done
    with open("outfile.txt", "w", encoding="utf-8") as stream:
        for url, title in urls_and_titles:
            stream.write(f"{url},{title}\n")

This approach waits until all tasks completed before writing out the result.这种方法会等到所有任务完成后再写出结果。 If we want to write out the tasks as soon as possible:如果我们想尽快写出任务:

from concurrent.futures import ThreadPoolExecutor, as_completed

...

if __name__ == "__main__":
    with open("urls.txt") as stream:
        urls = [line.strip() for line in stream]

    with ThreadPoolExecutor() as executor, open("outfile.txt", "w", encoding="utf-8") as stream:
        futures = [
            executor.submit(get_url, url)
            for url in urls
        ]
        for future in as_completed(futures):
            url, title = future.result()
            stream.write(f"{url},{title}\n")

The as_completed() function will take care to order the Futures object so the ones completed first is at the beginning of the queue. as_completed()函数将负责对Futures对象进行排序,以便首先完成的对象位于队列的开头。

In conclusion, the key here is for the worker function get_url to return some value and do not write to file.总之,这里的关键是让工作函数get_url返回一些值并且不写入文件。 That task will be done in the main thread.该任务将在主线程中完成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM