为什么这个 Python 方法会泄漏 memory？

Question

此方法遍历数据库中的术语列表，检查术语是否在作为参数传递的文本中，如果是，则将其替换为以术语作为参数的搜索页面的链接。

术语的数量很高（大约 100000），所以这个过程很慢，但这没关系，因为它是作为一个 cron 作业执行的。 但是，它会导致脚本 memory 消耗猛增，我找不到原因：

class SearchedTerm(models.Model):

[...]

@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        text. If they exist, turn them into links to the search
        page.

        This process is limited to `count` replacements maximum.

        WARNING: because the sites got different URLS schemas, we don't
        provides direct links, but we inject the {% url %} tag 
        so it must be rendered before display. You can use the `eval`
        tag from `libs` for this. Since they got different namespace as
        well, we enter a generic 'namespace' and delegate to the 
        template to change it with the proper one as well.

        If you have a batch process to do, you can pass a query set
        that will be used instead of getting all searched term at
        each calls.
    """

    found = 0

    terms = queryset or cls.on_site.all()

    # to avoid duplicate searched terms to be replaced twice 
    # keep a list of already linkified content
    # added words we are going to insert with the link so they won't match
    # in case of multi passes
    processed = set((u'video', u'streaming', u'title', 
                     u'search', u'namespace', u'href', u'title', 
                     u'url'))

    for term in terms:

        text = term.text.lower()

        # no small word and make
        # quick check to avoid all the rest of the matching
        if len(text) < 3 or text not in string:
            continue

        if found and cls._is_processed(text, processed):
            continue

        # match the search word with accent, for any case
        # ensure this is not part of a word by including 
        # two 'non-letter' character on both ends of the word
        pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text, 
                            re.UNICODE|re.IGNORECASE)

        if re.search(pattern, string):
            found += 1

            # create the link string
            # replace the word in the description 
            # use back references (\1, \2, etc) to preserve the original
            # formatin
            # use raw unicode strings (ur"string" notation) to avoid
            # problems with accents and escaping

            query = '-'.join(term.text.split())
            url = ur'{%% url namespace:static-search "%s" %%}' % query
            replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url

            string = re.sub(pattern, replace_with, string)

            processed.add(text)

            if found >= 3:
                break

    return string

您可能也需要此代码：

class SearchedTerm(models.Model):

[...]

@classmethod
def _is_processed(cls, text, processed):
    """
        Check if the text if part of the already processed string
        we don't use `in` the set, but `in ` each strings of the set
        to avoid subtring matching that will destroy the tags.

        This is mainly an utility function so you probably won't use
        it directly.
    """
    if text in processed:
        return True

    return any(((text in string) for string in processed))

我真的只有两个带有引用的对象可能是这里的嫌疑人： terms和processed 。 但我看不出他们有任何理由不被垃圾收集。

编辑：

我想我应该说这个方法是在 Django model 方法本身内部调用的。 我不知道它是否相关，但这里是代码：

class Video(models.Model):

[...]

def update_html_description(self, links=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        description. If they exist, turn them into links to the search
        engine. Put the reset into `html_description`.

        This use `add_search_link_to_text` and has therefor, the same 
        limitations.

        It DOESN'T call save().
    """
    queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
    text = self.description or self.title
    self.html_description = SearchedTerm.add_search_links_to_text(text, 
                                                                  links, 
                                                                  queryset)

我可以想象自动 Python 正则表达式缓存会占用一些 memory。 但它应该只执行一次，并且每次调用update_html_description时 memory 消耗都会增加。

问题不仅在于它消耗了大量的 memory，还在于它没有释放它：每次调用都会占用大约 3% 的 ram，最终将其填满并使用“无法分配内存”使脚本崩溃。

Answer 1

一旦你调用它，整个查询集就会被加载到 memory 中，这将吞噬你的 memory。 如果结果集那么大，您希望获得大量结果，它可能会更多地访问数据库，但这意味着 memory 消耗会少得多。

Answer 2

我完全找不到问题的原因，但现在我通过调用包含此方法调用的脚本（使用subprocess ）隔离臭名昭著的代码段来传递它。 memory 上升，但当然，在 python 进程终止后恢复正常。

说脏话。

但这就是我现在所得到的。

Answer 3

确保您没有在 DEBUG 中运行。

Answer 4

我想我应该说这个方法是在 Django model 方法本身内部调用的。

@classmethod

为什么？ 为什么这是“类水平”

为什么这些普通的方法不能拥有普通的 scope 规则并且——在正常的事件过程中——收集垃圾？

换句话说（以答案的形式）

摆脱@classmethod 。

为什么这个 Python 方法会泄漏 memory？

问题描述

4 个解决方案

解决方案1
3 2011-07-18 21:33:27

解决方案2
2 2011-07-18 23:47:32

解决方案3
1 2011-07-18 22:13:48

解决方案4
-1 2011-07-18 22:14:53

为什么这个 Python 方法会泄漏 memory？

问题描述

4 个解决方案

解决方案1 3 2011-07-18 21:33:27

解决方案2 2 2011-07-18 23:47:32

解决方案3 1 2011-07-18 22:13:48

解决方案4 -1 2011-07-18 22:14:53

解决方案1
3 2011-07-18 21:33:27

解决方案2
2 2011-07-18 23:47:32

解决方案3
1 2011-07-18 22:13:48

解决方案4
-1 2011-07-18 22:14:53