簡體   English   中英

使用BeautifulSoup和Requests解析html頁面源時出現內存泄漏

[英]Memory Leak while parsing html page source with BeautifulSoup & Requests

因此,基本思想是通過使用beautifulsoup刪除HTML標記和腳本來獲取對某些列表URL的請求並從這些頁面源解析文本。 python版本2.7

問題是,在每次請求時,解析器函數都會在每次請求時不斷添加內存。 尺寸逐漸增大。

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

甚至在本地文本文件中解析內存泄漏。 例如:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

在此輸入圖像描述

你可以嘗試調用垃圾收集器:

import gc
response.close()
response = None
gc.collect()

這也可以幫助你: 使用BeautifulSoup的Python高內存使用率

您可以嘗試在結束get_text_from_page_source函數之前運行soup.decompose以銷毀樹。

如果您打開文本文件而不是直接提供請求內容,可以在此處看到:

soup = BeautifulSoup(open(page_source),'html.parser')

記得在完成后關閉它。 為了簡短起見,您可以將該行更改為:

with open(page_source, 'r') as html_file:
    soup = BeautifulSoup(html_file.read(),'html.parser')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM