[英]Write to the same file with multiprocessing
I've been trying for a long time to write the results to my file, but since it's a multithreaded task, the files are written in a mixed way我一直在尝试将结果写入我的文件,但由于它是一个多线程任务,文件以混合方式写入
The task that adds the file is in the get_url function添加文件的任务在 get_url 函数中
And this fonction is launched via pool.submit(get_url,line)这个函数是通过 pool.submit(get_url,line) 启动的
import requests
from concurrent.futures import ThreadPoolExecutor
import fileinput
from bs4 import BeautifulSoup
import traceback
import threading
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import warnings
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
count_requests = 0
host_error = 0
def get_url(url):
try:
global count_requests
result_request = requests.get(url, verify=False)
soup = BeautifulSoup(result_request.text, 'html.parser')
with open('outfile.txt', 'a', encoding="utf-8") as f:
f.write(soup.title.get_text())
count_requests = count_requests + 1
except:
global host_error
host_error = host_error + 1
with ThreadPoolExecutor(max_workers=100) as pool:
for line in fileinput.input(['urls.txt']):
pool.submit(get_url,line)
print(str("requests success : ") + str(count_requests) + str(" | requests error ") + str(host_error), end='\r')
This is what the output looks like :这是输出的样子:
google.com - Google google.com - 谷歌
w3schools.com - W3Schools Online Web Tutorials w3schools.com - W3Schools 在线网络教程
You can use multiprocessing.Pool
and pool.imap_unordered
to receive processed results and write it to the file.您可以使用
multiprocessing.Pool
和pool.imap_unordered
接收处理后的结果并将其写入文件。 That way the results are written only inside main thread and won't be interleaved.这样结果只写在主线程内,不会被交错。 For example:
例如:
import requests
import multiprocessing
from bs4 import BeautifulSoup
def get_url(url):
# do your processing here:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
return soup.title.text
if __name__ == "__main__":
# read urls from file or other source:
urls = ["http://google.com", "http://yahoo.com"]
with multiprocessing.Pool() as p, open("result.txt", "a") as f_out:
for result in p.imap_unordered(get_url, urls):
print(result, file=f_out)
I agree with Andrej Kesely that we should not write to file within get_url
.我同意 Andrej Kesely 的观点,即我们不应该在
get_url
中写入文件。 Here is my approach:这是我的方法:
from concurrent.futures import ThreadPoolExecutor, as_completed
def get_url(url):
# Processing...
title = ...
return url, title
if __name__ == "__main__":
with open("urls.txt") as stream:
urls = [line.strip() for line in stream]
with ThreadPoolExecutor() as executor:
urls_and_titles = executor.map(get_url, urls)
# Exiting the with block: all tasks are done
with open("outfile.txt", "w", encoding="utf-8") as stream:
for url, title in urls_and_titles:
stream.write(f"{url},{title}\n")
This approach waits until all tasks completed before writing out the result.这种方法会等到所有任务完成后再写出结果。 If we want to write out the tasks as soon as possible:
如果我们想尽快写出任务:
from concurrent.futures import ThreadPoolExecutor, as_completed
...
if __name__ == "__main__":
with open("urls.txt") as stream:
urls = [line.strip() for line in stream]
with ThreadPoolExecutor() as executor, open("outfile.txt", "w", encoding="utf-8") as stream:
futures = [
executor.submit(get_url, url)
for url in urls
]
for future in as_completed(futures):
url, title = future.result()
stream.write(f"{url},{title}\n")
The as_completed()
function will take care to order the Futures
object so the ones completed first is at the beginning of the queue. as_completed()
函数将负责对Futures
对象进行排序,以便首先完成的对象位于队列的开头。
In conclusion, the key here is for the worker function get_url
to return some value and do not write to file.总之,这里的关键是让工作函数
get_url
返回一些值并且不写入文件。 That task will be done in the main thread.该任务将在主线程中完成。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.