繁体   English   中英

在 Python 中打开一个大文件而不会耗尽 RAM

[英]Open a large file in Python without running out of RAM

我正在尝试在 Python 中打开一个包含 500 万个条目的文件

我的代码需要按顺序逐行获取输入(与ThreadPoolExecutor相差几十行不是问题),然后由ThreadPoolExecutor获取发送到get_url函数

可能 url 变量太大了,ThreadPoolExecutor 自己应该通过保留一个行计数器从文件中一一检索它们吗? 我已经尝试过这样做但失败了(这是我第一次使用 ThreadPoolExecutor)

with open("1_1.txt") as stream:
    urls = [line.strip() for line in stream]

with ThreadPoolExecutor(max_workers=50) as pool:
    pool.map(get_url, urls)

我的脚本的完整代码:

import requests
from concurrent.futures import ThreadPoolExecutor
import fileinput
from bs4 import BeautifulSoup
import traceback
from threading import Thread


from requests.packages.urllib3.exceptions import InsecureRequestWarning
import warnings

from random import random
from queue import Queue


requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

count_requests = 0
host_error = 0


def get_url(url):
    global queue
    global count_requests
    global host_error

    try:
        

        result_request = requests.get(url, verify=False, timeout=40)
        soup = BeautifulSoup(result_request.text, 'html.parser')

        title = soup.title.get_text().splitlines(False)
        
        title = str(title)
        title = title[0:10000]


        count_requests = count_requests + 1
        
        queue.put(f'{url} - {title} \n')
        
      

      
    except:
        queue.put(f'FAILED : {url} \n')
        host_error = host_error + 1


# dedicated file writing task
def file_writer(filepath, queue):
    global count_requests
    # open the file
    with open(filepath, 'a', encoding="utf-8") as file:
        # run until the event is set
        while True:
            # get a line of text from the queue
            line = queue.get()
            # check if we are done
            if line is None:
                # exit the loop
                break
            # write it to file
            file.write(line)
            # flush the buffer
            file.flush()
            # mark the unit of work complete
            queue.task_done()
    # mark the exit signal as processed, after the file was closed
    queue.task_done()
    
    



# create the shared queue
queue = Queue()
# defile the shared file path
filepath = 'output.txt'
# create and start the file writer thread
writer_thread = Thread(target=file_writer, args=(filepath,queue), daemon=True)
writer_thread.start()



# wait for all tasks in the queue to be processed
queue.join()



with open("1_1.txt") as stream:
    urls = [line.strip() for line in stream]

with ThreadPoolExecutor(max_workers=1) as pool:
    pool.map(get_url, urls)

从 10 小时前开始,随着 concurrent.futures.ThreadPoolExecutor 持续增加的这篇文章 RAM提出了相同的问题并提供了相同或至少非常相似的代码。

万一有人解决了这个问题,他们也可以给另一个海报一点爱!

很抱歉发布作为答案,但帐户太新而无法发表评论。

尝试使用submit()方法:

with open("1_1.txt") as stream:
    with ThreadPoolExecutor(max_workers=50) as pool:
        for url in stream:
            pool.submit(get_url,url.strip())

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM