简体   繁体   English

在python脚本中释放内存

[英]Releasing memory in python script

I have a python script that scrapes some urls. 我有一个python脚本,刮了一些网址。 I have a list of urls and for each url I get html and do some logic with it. 我有一个网址列表,每个网址我得到HTML并用它做一些逻辑。

I use Python 2.7.6 and Linux Mint 17 Cinnamon 64-bit. 我使用Python 2.7.6和Linux Mint 17 Cinnamon 64位。

Problem is that my main object for scraping, which I instance for every url, is never released from memory although there is no reference to it. 问题是我的主要抓取对象(我为每个url实例化)从不从内存中释放,尽管没有引用它。 With that issue my memory just keeps growing and growing rapidly (since my object is sometimes very big - up to 50MB). 有了这个问题,我的记忆就会不断增长和快速增长(因为我的目标有时非常大 - 高达50MB)。

Simplify code looks something like this: 简化代码看起来像这样:

def scrape_url(url):
    """
    Simple helper method for scraping url
    :param url: url for scraping
    :return: some result
    """
    scraper = Scraper(url)  # instance main Scrape object
    result = scraper.scrape()  # scrape it

    return result

## SCRIPT STARTS HERE
urls = get_urls()  # fetch some list of urls

for url in urls:
    print 'MEMORY USAGE BEFORE SCRAPE: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    result = scrape_url(url)  # call helper method for scraping
    print 'MEMORY USAGE AFTER SCRAPE: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    print '-' * 50

My output is something like this: 我的输出是这样的:

MEMORY USAGE BEFORE SCRAPE: 75732 (kb)
MEMORY USAGE AFTER SCRAPE: 137392 (kb)
--------------------------------------------------
MEMORY USAGE BEFORE SCRAPE: 137392 (kb)
MEMORY USAGE AFTER SCRAPE: 206748 (kb)
--------------------------------------------------
MEMORY USAGE BEFORE SCRAPE: 206748 (kb)
MEMORY USAGE AFTER SCRAPE: 284348 (kb)
--------------------------------------------------

Scrape object is big and it is not released from memory. Scrape对象很大,不会从内存中释放出来。 I tried: 我试过了:

scraper = None

del scraper

or even call gc to collect object with : 甚至调用gc来收集对象:

gc.collect()

but nothing helped. 但没有任何帮助。

When I print number of references to scraper object with: 当我打印刮刀对象的引用数量时:

print sys.getrefcount(scraper)

I get 2 which I think means that there is no other references to object and should be cleaned by gc. 我得到2 ,我认为这意味着没有其他对象的引用,应该由gc清理。

Scraper object has lots of subobjects. Scraper对象有很多子对象。 Is is possible that some of it sub object's reference get left somewhere and for that reason gc cannot release main Scaper object or there is some other reason why python doesn't release memory? 有可能它的某些子对象的引用被遗留在某处,因此gc无法释放主Scaper对象,或者还有一些其他原因导致python不释放内存?

I found some topic regarding this in SO and some of the responses where they are talking that memory cannot be released unless you are spawning/killing child processes which sounds really strange ( LINK ) 我在SO中找到了一些关于此问题的主题以及他们所说的内存无法释放的一些回复,除非你产生/杀死听起来很奇怪的子进程( LINK

Thanks, Ivan 谢谢,伊万

You are using an iterator, which has to be in memory at all times. 您正在使用迭代器,它必须始终在内存中。 Rewrite your loop to use generator and lazily scrape. 重写你的循环使用发电机和懒惰刮。 Something along the lines of: 有点像:

def gen():
        for i in xrange(0, len(urls)):
            yield urls[i]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM