簡體   English   中英

減少Python腳本中的RAM使用量

[英]Reduce RAM usage in Python script

我編寫了一個快速的小程序,用於從包含有關書籍翻譯信息的教科文組織網站上抓取書籍數據。 該代碼可以滿足我的要求,但是到處理大約20個國家/地區時,它正在使用約6GB的RAM。 由於大約需要處理200個文件,因此這對我來說不起作用。

我不確定所有RAM使用量來自哪里,所以我不確定如何減少它。 我假設這是詞典,其中包含所有圖書信息,但我並不肯定。 我不確定是否只應讓程序在每個國家/地區運行一次,而不是處理很多程序? 還是有更好的方法呢?

這是我第一次寫這樣的東西,而且我是一個非常新手,自學成才的程序員,所以請指出代碼中的任何重大缺陷或您可能沒有與問題直接相關的改進技巧在眼前。

這是我的代碼,在此先感謝您的協助。

from __future__ import print_function
import urllib2, os
from bs4 import BeautifulSoup, SoupStrainer

''' Set list of countries and their code for niceness in explaining what
is actually going on as the program runs. '''
countries = {"AFG":"Afghanistan","ALA":"Aland Islands","DZA":"Algeria"}

'''List of country codes since dictionaries aren't sorted in any
way, this makes processing easier to deal with if it fails at
some point, mid run.'''
country_code_list = ["AFG","ALA","DZA"]

base_url = "http://www.unesco.org/xtrans/bsresult.aspx?lg=0&c="
destination_directory = "/Users/robbie/Test/"
only_restable = SoupStrainer(class_="restable")

class Book(object):
    def set_author(self,book): 
        '''Parse the webpage to find author names. Finds last name, then
        first name of original author(s) and sets the Book object's 
        Author attribute to the resulting string.'''

        authors = ""
        author_last_names = book.find_all('span',class_="sn_auth_name")
        author_first_names = book.find_all('span', attrs={\
            'class':"sn_auth_first_name"})
        if author_last_names == []: self.Author = [" "]

        for author in author_last_names:
            try: 
                first_name = author_first_names.pop()
                authors = authors + author.getText() + ', ' + \
                    first_name.getText()

            except IndexError:
                authors = authors + (author.getText())
        self.author = authors

    def set_quality(self,book):
        ''' Check to see if book page is using Quality, then set it if 
        so.'''

        quality = book.find_all('span', class_="sn_auth_quality")

        if len(quality) == 0: self.quality = " "

        else: self.quality = quality[0].contents[0]

    def set_target_title(self,book): 
        target_title = book.find_all('span', class_="sn_target_title")
        if len(target_title) == 0: self.target_title = " "
        else: self.target_title = target_title[0].contents[0]

    def set_target_language(self,book): 
        target_language = book.find_all('span', class_="sn_target_lang")
        if len(target_language) == 0: self.target_language = " "
        else: self.target_language = target_language[0].contents[0]

    def set_translator_name(self,book) : 
        translators = ""
        translator_last_names = book.find_all('span', class_="sn_transl_name")
        translator_first_names = book.find_all('span', \
                                               class_="sn_transl_first_name")
        if translator_first_names == [] and translator_last_names == [] :
            self.translators = " "
            return None

        for translator in translator_last_names:
            try: 
                first_name = translator_first_names.pop()
                translators = translators + \
                    (translator.getText() + ',' \
                     + first_name.getText())
            except IndexError:
                translators = translators + \
                    (translator.getText())

        self.translators = translators  

    def set_published_city(self,book) : 
        published_city = book.find_all('span', class_="place")
        if len(published_city) == 0: 
            self.published_city = " "
        else: self.published_city = published_city[0].contents[0]

    def set_publisher(self,book) : 
        publisher = book.find_all('span', class_="place")
        if len(publisher) == 0: 
            self.publisher = " "
        else: self.publisher = publisher[0].contents[0] 

    def set_published_country(self,book) : 
        published_country = book.find_all('span', \
                                        class_="sn_country")
        if len(published_country) == 0: 
            self.published_country = " "
        else: self.published_country = published_country[0].contents[0]

    def set_year(self,book) : 
        year = book.find_all('span', class_="sn_year")
        if len(year) == 0: 
            self.year = " "
        else: self.year = year[0].contents[0]   

    def set_pages(self,book) : 
        pages = book.find_all('span', class_="sn_pagination")
        if len(pages) == 0: 
            self.pages = " "
        else: self.pages = pages[0].contents[0] 

    def set_edition(self, book) :
        edition = book.find_all('span', class_="sn_editionstat")
        if len(edition) == 0: 
            self.edition = " "
        else: self.edition = edition[0].contents[0]

    def set_original_title(self,book) : 
        original_title = book.find_all('span', class_="sn_orig_title")
        if len(original_title) == 0: 
            self.original_title = " "
        else: self.original_title = original_title[0].contents[0]   

    def set_original_language(self,book) :
        languages = ''
        original_languages = book.find_all('span', \
                                         class_="sn_orig_lang")

        for language in original_languages:
            languages = languages + language.getText() + ', '

        self.original_languages = languages

    def export(self, country): 
        ''' Function to allow us to easilly pull the text from the 
        contents of the Book object's attributes and write them to the 
        country in which the book was published's CSV file.'''

        file_name = os.path.join(destination_directory + country + ".csv")

        with open(file_name, "a") as by_country_csv:        
            print(self.author.encode('UTF-8') + " & " + \
                  self.quality.encode('UTF-8') + " & " + \
                  self.target_title.encode('UTF-8') + " & " + \
                  self.target_language.encode('UTF-8') + " & " + \
                  self.translators.encode('UTF-8') + " & " + \
                  self.published_city.encode('UTF-8') + " & " + \
                  self.publisher.encode('UTF-8') + " & " + \

                  self.published_country.encode('UTF-8') + " & " + \
                  self.year.encode('UTF-8') + " & " + \
                  self.pages.encode('UTF-8') + " & " + \
                  self.edition.encode('UTF-8') + " & " + \
                  self.original_title.encode('UTF-8') + " & " + \
                  self.original_languages.encode('UTF-8'), file=by_country_csv)

        by_country_csv.close()

    def __init__(self, book, country):
        ''' Initialize the Book object by feeding it the HTML for its 
        row'''
        self.set_author(book)
        self.set_quality(book)
        self.set_target_title(book)
        self.set_target_language(book)

        self.set_translator_name(book)
        self.set_published_city(book)
        self.set_publisher(book)
        self.set_published_country(book)

        self.set_year(book)
        self.set_pages(book)
        self.set_edition(book)
        self.set_original_title(book)

        self.set_original_language(book)


def get_all_pages(country,base_url):
    ''' Create a list of URLs to be crawled by adding the ISO_3166-1_alpha-3
    country code to the URL and then iterating through the results every 10
    pages. Returns a string.'''

    base_page = urllib2.urlopen(base_url+country)
    page = BeautifulSoup(base_page, parse_only=only_restable)

    result_number = page.find_all('td',class_="res1",limit=1)
    if not result_number:
        return 0

    str_result_number = str(result_number[0].getText())
    results_total = int(str_result_number.split('/')[1])

    page.decompose()

    return results_total


def build_list(country_code_list, countries):
    '''  Build the list of all the books, and return a list of Book objects
    in case you want to do something with them in something else, ever.'''
    for country in country_code_list:

        print("Processing %s now..." % countries[country])
        results_total = get_all_pages(country, base_url)

        for url in range(results_total):
            if url % 10 == 0 :
                all_books = []  
                target_page = urllib2.urlopen(base_url + country \
                                             +"&fr="+str(url))
                page = BeautifulSoup(target_page, parse_only=only_restable)
                books = page.find_all('td',class_="res2")
                for book in books:
                    all_books.append(Book (book,country))
                page.decompose()

                for title in all_books:
                    title.export(country)    
    return

if __name__ == "__main__":
    build_list(country_code_list,countries)
    print("Completed.")

我想我只會按特定順序列出一些問題或可能的改進:

  1. 遵循PEP 8

    現在,您有很多使用駝峰式命名的變量和函數,例如setAuthor 這不是Python的常規樣式; Python通常會將該set_author命名為(和published_country而不是PublishedCountry等)。 您甚至可以更改某些正在調用的名稱:首先,BeautifulSoup支持findAll以實現兼容性,但建議使用find_all

    除了命名以外,PEP 8還指定了其他一些內容。 例如,您想要重寫此代碼:

     if len(resultNumber) == 0 : return 0 

    這樣:

     if len(result_number) == 0: return 0 

    甚至考慮到空列表是虛假的事實:

     if not result_number: return 0 
  2. SoupStrainer傳遞給BeautifulSoup

    您正在尋找的信息可能僅在文檔的一部分中。 您不需要將整個事情解析成一棵樹。 SoupStrainer作為parse_only參數傳遞給BeautifulSoup 這應該通過盡早丟棄不必要的部分來減少內存使用。

  3. decomposedecompose

    Python 主要使用引用計數,因此刪除所有循環引用(如decompose一樣)應使其主要機制進行垃圾收集,引用計數,從而釋放大量內存。 Python還具有一個半傳統的垃圾收集器來處理循環引用,但是引用計數要快得多。

  4. 不要讓Book.__init__將內容寫入磁盤。

    在大多數情況下,我不希望僅創建類的實例將某些內容寫入磁盤。 刪除export呼叫; 如果用戶希望將export到磁盤上,請讓他們致電。

  5. 停止保留內存中的大量數據。

    您正在將所有這些數據累積到字典中,以便隨后將其導出。 減少內存的明顯做法是盡快將其轉儲到磁盤。 您的評論表明您正在將其放入字典中以保持靈活性; 但這並不意味着您必須將所有內容收集在一個列表中:使用生成器,在刮​​取項目時產生它們。 然后用戶可以像列表一樣遍歷它:

     for book in scrape_books(): book.export() 

    …但優點是一次最多只能保存一本書。

  6. 使用os.path的函數,而不是自己修改路徑。

    現在,關於路徑名,您的代碼非常脆弱。 如果我不小心從destinationDirectory刪除了結尾的斜杠,則會發生意外情況。 使用os.path.join可以防止這種情況的發生並處理跨平台的差異:

     >>> os.path.join("/Users/robbie/Test/", "USA") '/Users/robbie/Test/USA' >>> os.path.join("/Users/robbie/Test", "USA") # still works! '/Users/robbie/Test/USA' >>> # or say we were on Windows: >>> os.path.join(r"C:\\Documents and Settings\\robbie\\Test", "USA") 'C:\\\\Documents and Settings\\\\robbie\\\\Test\\\\USA' 
  7. attrs={"class":...}縮寫為class_=...

    BeautifulSoup 4.1.2引入了使用class_搜索的功能,從而不再需要冗長的attrs={"class":...}

我想您還有更多可以更改的內容,但是首先要進行很多更改。

最后,您想要該書的目的是什么? 您應該將每本書導出到“ for url in range”塊的末尾(在其中),並且不要使用allbook dict。 如果您確實需要一個列表,請准確定義所需的信息,而不保留完整的Book對象。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM