簡體   English   中英

Python 腳本使用所有 RAM

[英]Python script uses all RAM

我有一個 Python 腳本,用於解析大型文檔中的電子郵件。 該腳本使用了我機器上的所有 RAM,並使其鎖定到我必須重新啟動它的位置。 我想知道是否有一種方法可以限制這種情況,或者甚至在完成讀取一個文件並提供一些輸出后暫停一下。 任何幫助都會非常感謝。

#!/usr/bin/env python

# Extracts email addresses from one or more plain text files.
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
# - Does not save to file (pipe the output to a file if you want it saved).
# Twitter @Critical24 - DefensiveThinking.io 

from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://foo@bar.com' as '//foo@bar.com'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

import os
not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk('.'):#This recursively searches all sub directories for files
for file in files:
    _,file_ext = os.path.splitext(file)#Here we get the extension of the file
    file_path = os.path.join(root,file)
    if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
       print("File %s is not parseble"%file_path)
       continue #This one continues the loop to the next file
    if os.path.isfile(file_path):
        for email in get_emails(file_to_str(file_path)):


import resource
resource.setrlimit(resource.RLIMIT_AS, (megs * 1048576L, -1L))

您似乎正在使用f.read()高達8 GB的文件讀入內存。 相反,您可以嘗試將正則表達式應用於文件的每一行,而不必將整個文件放在內存中。

with open(filename, encoding='utf-8') as f: #Added encoding='utf-8'
    return (email[0] for line in f
                     for email in re.findall(regex, line.lower())
                     if not email[0].startswith('//'))

但是,這仍然需要很長時間。 另外,我沒有檢查你的正則表達式可能存在的問題。


聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

粵ICP備18138465號  © 2020-2024 STACKOOM.COM