简体   繁体   English

减少Python脚本中的RAM使用量

[英]Reduce RAM usage in Python script

I've written a quick little program to scrape book data off of a UNESCO website which contains information about book translations. 我编写了一个快速的小程序,用于从包含有关书籍翻译信息的教科文组织网站上抓取书籍数据。 The code is doing what I want it to, but by the time it's processed about 20 countries, it's using ~6GB of RAM. 该代码可以满足我的要求,但是到处理大约20个国家/地区时,它正在使用约6GB的RAM。 Since there are around 200 I need to process, this isn't going to work for me. 由于大约需要处理200个文件,因此这对我来说不起作用。

I'm not sure where all the RAM usage is coming from, so I'm not sure how to reduce it. 我不确定所有RAM使用量来自哪里,所以我不确定如何减少它。 I'm assuming that it's the dictionary that's holding all the book information, but I'm not positive. 我假设这是词典,其中包含所有图书信息,但我并不肯定。 I'm not sure if I should simply make the program run once for each country, rather than processing the lot of them? 我不确定是否只应让程序在每个国家/地区运行一次,而不是处理很多程序? Or if there's a better way to do it? 还是有更好的方法呢?

This is the first time I've written anything like this, and I'm a pretty novice, self-taught programmer, so please point out any significant flaws in the code, or improvement tips you have that may not directly relate to the question at hand. 这是我第一次写这样的东西,而且我是一个非常新手,自学成才的程序员,所以请指出代码中的任何重大缺陷或您可能没有与问题直接相关的改进技巧在眼前。

This is my code, thanks in advance for any assistance. 这是我的代码,在此先感谢您的协助。

from __future__ import print_function
import urllib2, os
from bs4 import BeautifulSoup, SoupStrainer

''' Set list of countries and their code for niceness in explaining what
is actually going on as the program runs. '''
countries = {"AFG":"Afghanistan","ALA":"Aland Islands","DZA":"Algeria"}

'''List of country codes since dictionaries aren't sorted in any
way, this makes processing easier to deal with if it fails at
some point, mid run.'''
country_code_list = ["AFG","ALA","DZA"]

base_url = "http://www.unesco.org/xtrans/bsresult.aspx?lg=0&c="
destination_directory = "/Users/robbie/Test/"
only_restable = SoupStrainer(class_="restable")

class Book(object):
    def set_author(self,book): 
        '''Parse the webpage to find author names. Finds last name, then
        first name of original author(s) and sets the Book object's 
        Author attribute to the resulting string.'''

        authors = ""
        author_last_names = book.find_all('span',class_="sn_auth_name")
        author_first_names = book.find_all('span', attrs={\
            'class':"sn_auth_first_name"})
        if author_last_names == []: self.Author = [" "]

        for author in author_last_names:
            try: 
                first_name = author_first_names.pop()
                authors = authors + author.getText() + ', ' + \
                    first_name.getText()

            except IndexError:
                authors = authors + (author.getText())
        self.author = authors

    def set_quality(self,book):
        ''' Check to see if book page is using Quality, then set it if 
        so.'''

        quality = book.find_all('span', class_="sn_auth_quality")

        if len(quality) == 0: self.quality = " "

        else: self.quality = quality[0].contents[0]

    def set_target_title(self,book): 
        target_title = book.find_all('span', class_="sn_target_title")
        if len(target_title) == 0: self.target_title = " "
        else: self.target_title = target_title[0].contents[0]

    def set_target_language(self,book): 
        target_language = book.find_all('span', class_="sn_target_lang")
        if len(target_language) == 0: self.target_language = " "
        else: self.target_language = target_language[0].contents[0]

    def set_translator_name(self,book) : 
        translators = ""
        translator_last_names = book.find_all('span', class_="sn_transl_name")
        translator_first_names = book.find_all('span', \
                                               class_="sn_transl_first_name")
        if translator_first_names == [] and translator_last_names == [] :
            self.translators = " "
            return None

        for translator in translator_last_names:
            try: 
                first_name = translator_first_names.pop()
                translators = translators + \
                    (translator.getText() + ',' \
                     + first_name.getText())
            except IndexError:
                translators = translators + \
                    (translator.getText())

        self.translators = translators  

    def set_published_city(self,book) : 
        published_city = book.find_all('span', class_="place")
        if len(published_city) == 0: 
            self.published_city = " "
        else: self.published_city = published_city[0].contents[0]

    def set_publisher(self,book) : 
        publisher = book.find_all('span', class_="place")
        if len(publisher) == 0: 
            self.publisher = " "
        else: self.publisher = publisher[0].contents[0] 

    def set_published_country(self,book) : 
        published_country = book.find_all('span', \
                                        class_="sn_country")
        if len(published_country) == 0: 
            self.published_country = " "
        else: self.published_country = published_country[0].contents[0]

    def set_year(self,book) : 
        year = book.find_all('span', class_="sn_year")
        if len(year) == 0: 
            self.year = " "
        else: self.year = year[0].contents[0]   

    def set_pages(self,book) : 
        pages = book.find_all('span', class_="sn_pagination")
        if len(pages) == 0: 
            self.pages = " "
        else: self.pages = pages[0].contents[0] 

    def set_edition(self, book) :
        edition = book.find_all('span', class_="sn_editionstat")
        if len(edition) == 0: 
            self.edition = " "
        else: self.edition = edition[0].contents[0]

    def set_original_title(self,book) : 
        original_title = book.find_all('span', class_="sn_orig_title")
        if len(original_title) == 0: 
            self.original_title = " "
        else: self.original_title = original_title[0].contents[0]   

    def set_original_language(self,book) :
        languages = ''
        original_languages = book.find_all('span', \
                                         class_="sn_orig_lang")

        for language in original_languages:
            languages = languages + language.getText() + ', '

        self.original_languages = languages

    def export(self, country): 
        ''' Function to allow us to easilly pull the text from the 
        contents of the Book object's attributes and write them to the 
        country in which the book was published's CSV file.'''

        file_name = os.path.join(destination_directory + country + ".csv")

        with open(file_name, "a") as by_country_csv:        
            print(self.author.encode('UTF-8') + " & " + \
                  self.quality.encode('UTF-8') + " & " + \
                  self.target_title.encode('UTF-8') + " & " + \
                  self.target_language.encode('UTF-8') + " & " + \
                  self.translators.encode('UTF-8') + " & " + \
                  self.published_city.encode('UTF-8') + " & " + \
                  self.publisher.encode('UTF-8') + " & " + \

                  self.published_country.encode('UTF-8') + " & " + \
                  self.year.encode('UTF-8') + " & " + \
                  self.pages.encode('UTF-8') + " & " + \
                  self.edition.encode('UTF-8') + " & " + \
                  self.original_title.encode('UTF-8') + " & " + \
                  self.original_languages.encode('UTF-8'), file=by_country_csv)

        by_country_csv.close()

    def __init__(self, book, country):
        ''' Initialize the Book object by feeding it the HTML for its 
        row'''
        self.set_author(book)
        self.set_quality(book)
        self.set_target_title(book)
        self.set_target_language(book)

        self.set_translator_name(book)
        self.set_published_city(book)
        self.set_publisher(book)
        self.set_published_country(book)

        self.set_year(book)
        self.set_pages(book)
        self.set_edition(book)
        self.set_original_title(book)

        self.set_original_language(book)


def get_all_pages(country,base_url):
    ''' Create a list of URLs to be crawled by adding the ISO_3166-1_alpha-3
    country code to the URL and then iterating through the results every 10
    pages. Returns a string.'''

    base_page = urllib2.urlopen(base_url+country)
    page = BeautifulSoup(base_page, parse_only=only_restable)

    result_number = page.find_all('td',class_="res1",limit=1)
    if not result_number:
        return 0

    str_result_number = str(result_number[0].getText())
    results_total = int(str_result_number.split('/')[1])

    page.decompose()

    return results_total


def build_list(country_code_list, countries):
    '''  Build the list of all the books, and return a list of Book objects
    in case you want to do something with them in something else, ever.'''
    for country in country_code_list:

        print("Processing %s now..." % countries[country])
        results_total = get_all_pages(country, base_url)

        for url in range(results_total):
            if url % 10 == 0 :
                all_books = []  
                target_page = urllib2.urlopen(base_url + country \
                                             +"&fr="+str(url))
                page = BeautifulSoup(target_page, parse_only=only_restable)
                books = page.find_all('td',class_="res2")
                for book in books:
                    all_books.append(Book (book,country))
                page.decompose()

                for title in all_books:
                    title.export(country)    
    return

if __name__ == "__main__":
    build_list(country_code_list,countries)
    print("Completed.")

I guess I'll just list off some of the problems or possible improvements in no particular order: 我想我只会按特定顺序列出一些问题或可能的改进:

  1. Follow PEP 8 . 遵循PEP 8

    Right now, you've got lots of variables and functions named using camel-case like setAuthor . 现在,您有很多使用驼峰式命名的变量和函数,例如setAuthor That's not the conventional style for Python; 这不是Python的常规样式; Python would typically named that set_author (and published_country rather than PublishedCountry , etc.). Python通常会将该set_author命名为(和published_country而不是PublishedCountry等)。 You can even change the names of some of the things you're calling: for one, BeautifulSoup supports findAll for compatibility, but find_all is recommended. 您甚至可以更改某些正在调用的名称:首先,BeautifulSoup支持findAll以实现兼容性,但建议使用find_all

    Besides naming, PEP 8 also specifies a few other things; 除了命名以外,PEP 8还指定了其他一些内容。 for example, you'd want to rewrite this: 例如,您想要重写此代码:

     if len(resultNumber) == 0 : return 0 

    as this: 这样:

     if len(result_number) == 0: return 0 

    or even taking into account the fact that empty lists are falsy: 甚至考虑到空列表是虚假的事实:

     if not result_number: return 0 
  2. Pass a SoupStrainer to BeautifulSoup . SoupStrainer传递给BeautifulSoup

    The information you're looking for is probably in only part of the document; 您正在寻找的信息可能仅在文档的一部分中。 you don't need to parse the whole thing into a tree. 您不需要将整个事情解析成一棵树。 Pass a SoupStrainer as the parse_only argument to BeautifulSoup . SoupStrainer作为parse_only参数传递给BeautifulSoup This should reduce memory usage by discarding unnecessary parts early. 这应该通过尽早丢弃不必要的部分来减少内存使用。

  3. decompose the soup when you're done with it. decomposedecompose

    Python primarily uses reference counting, so removing all circular references (as decompose does) should let its primary mechanism for garbage collection, reference counting, free up a lot of memory. Python 主要使用引用计数,因此删除所有循环引用(如decompose一样)应使其主要机制进行垃圾收集,引用计数,从而释放大量内存。 Python also has a semi-traditional garbage collector to deal with circular references, but reference counting is much faster. Python还具有一个半传统的垃圾收集器来处理循环引用,但是引用计数要快得多。

  4. Don't make Book.__init__ write things to disk. 不要让Book.__init__将内容写入磁盘。

    In most cases, I wouldn't expect just creating an instance of a class to write something to disk. 在大多数情况下,我不希望仅创建类的实例将某些内容写入磁盘。 Remove the call to export ; 删除export呼叫; let the user call export if they want it to be put on the disk. 如果用户希望将export到磁盘上,请让他们致电。

  5. Stop holding on to so much data in memory. 停止保留内存中的大量数据。

    You're accumulating all this data into a dictionary just to export it afterwards. 您正在将所有这些数据累积到字典中,以便随后将其导出。 The obvious thing to do to reduce memory is to dump it to disk as soon as possible. 减少内存的明显做法是尽快将其转储到磁盘。 Your comment indicates that you're putting it in a dictionary to be flexible; 您的评论表明您正在将其放入字典中以保持灵活性; but that doesn't mean you have to collect it all in a list: use a generator, yielding items as you scrape them. 但这并不意味着您必须将所有内容收集在一个列表中:使用生成器,在刮​​取项目时产生它们。 Then the user can iterate over it just like a list: 然后用户可以像列表一样遍历它:

     for book in scrape_books(): book.export() 

    …but with the advantage that at most one book will be kept in memory at a time. …但优点是一次最多只能保存一本书。

  6. Use the functions in os.path rather than munging paths yourself. 使用os.path的函数,而不是自己修改路径。

    Your code right now is rather fragile when it comes to path names. 现在,关于路径名,您的代码非常脆弱。 If I accidentally removed the trailing slash from destinationDirectory , something unintended happens. 如果我不小心从destinationDirectory删除了结尾的斜杠,则会发生意外情况。 Using os.path.join prevents that from happening and deals with cross-platform differences: 使用os.path.join可以防止这种情况的发生并处理跨平台的差异:

     >>> os.path.join("/Users/robbie/Test/", "USA") '/Users/robbie/Test/USA' >>> os.path.join("/Users/robbie/Test", "USA") # still works! '/Users/robbie/Test/USA' >>> # or say we were on Windows: >>> os.path.join(r"C:\\Documents and Settings\\robbie\\Test", "USA") 'C:\\\\Documents and Settings\\\\robbie\\\\Test\\\\USA' 
  7. Abbreviate attrs={"class":...} to class_=... . attrs={"class":...}缩写为class_=...

    BeautifulSoup 4.1.2 introduces searching with class_ , which removes the need for the verbose attrs={"class":...} . BeautifulSoup 4.1.2引入了使用class_搜索的功能,从而不再需要冗长的attrs={"class":...}

I imagine there are even more things you can change, but that's quite a few to start with. 我想您还有更多可以更改的内容,但是首先要进行很多更改。

What do you want the booklist for, in the end? 最后,您想要该书的目的是什么? You should export each book at the end of the "for url in range" block (inside it), and do without the allbooks dict. 您应该将每本书导出到“ for url in range”块的末尾(在其中),并且不要使用allbook dict。 If you really need a list, define exactly what infos you will need, not keeping full Book objects. 如果您确实需要一个列表,请准确定义所需的信息,而不保留完整的Book对象。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM