[英]Open a large file in Python without running out of RAM
我正在尝试在 Python 中打开一个包含 500 万个条目的文件
我的代码需要按顺序逐行获取输入(与ThreadPoolExecutor相差几十行不是问题),然后由ThreadPoolExecutor获取发送到get_url函数
可能 url 变量太大了,ThreadPoolExecutor 自己应该通过保留一个行计数器从文件中一一检索它们吗? 我已经尝试过这样做但失败了(这是我第一次使用 ThreadPoolExecutor)
with open("1_1.txt") as stream:
urls = [line.strip() for line in stream]
with ThreadPoolExecutor(max_workers=50) as pool:
pool.map(get_url, urls)
我的脚本的完整代码:
import requests
from concurrent.futures import ThreadPoolExecutor
import fileinput
from bs4 import BeautifulSoup
import traceback
from threading import Thread
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import warnings
from random import random
from queue import Queue
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')
count_requests = 0
host_error = 0
def get_url(url):
global queue
global count_requests
global host_error
try:
result_request = requests.get(url, verify=False, timeout=40)
soup = BeautifulSoup(result_request.text, 'html.parser')
title = soup.title.get_text().splitlines(False)
title = str(title)
title = title[0:10000]
count_requests = count_requests + 1
queue.put(f'{url} - {title} \n')
except:
queue.put(f'FAILED : {url} \n')
host_error = host_error + 1
# dedicated file writing task
def file_writer(filepath, queue):
global count_requests
# open the file
with open(filepath, 'a', encoding="utf-8") as file:
# run until the event is set
while True:
# get a line of text from the queue
line = queue.get()
# check if we are done
if line is None:
# exit the loop
break
# write it to file
file.write(line)
# flush the buffer
file.flush()
# mark the unit of work complete
queue.task_done()
# mark the exit signal as processed, after the file was closed
queue.task_done()
# create the shared queue
queue = Queue()
# defile the shared file path
filepath = 'output.txt'
# create and start the file writer thread
writer_thread = Thread(target=file_writer, args=(filepath,queue), daemon=True)
writer_thread.start()
# wait for all tasks in the queue to be processed
queue.join()
with open("1_1.txt") as stream:
urls = [line.strip() for line in stream]
with ThreadPoolExecutor(max_workers=1) as pool:
pool.map(get_url, urls)
从 10 小时前开始,随着 concurrent.futures.ThreadPoolExecutor 持续增加的这篇文章 RAM提出了相同的问题并提供了相同或至少非常相似的代码。
万一有人解决了这个问题,他们也可以给另一个海报一点爱!
很抱歉发布作为答案,但帐户太新而无法发表评论。
尝试使用submit()
方法:
with open("1_1.txt") as stream:
with ThreadPoolExecutor(max_workers=50) as pool:
for url in stream:
pool.submit(get_url,url.strip())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.