简体   繁体   English

Heroku上的Django Celery任务导致高内存使用率

[英]Django Celery task on Heroku causes high memory usage

I have a celery task on Heroku that connects to an external API and retrieves some data, stores in the database and repeats several hundred times. 我在Heroku上有一个celery任务,它连接到一个外部API并检索一些数据,存储在数据库中并重复几百次。 Very quickly (after ~10 loops) Heroku starts warning about high memory usage. 非常快(在~10次循环之后)Heroku开始警告高内存使用率。 Any ideas? 有任何想法吗?

tasks.py tasks.py

@app.task
def retrieve_details():
    for p in PObj.objects.filter(some_condition=True):
        p.fetch()

models.py models.py

def fetch(self):
    v_data = self.service.getV(**dict(
        Number=self.v.number
    ))
    response = self.map_response(v_data)

    for key in ["some_key","some_other_key",]:
        setattr(self.v, key, response.get(key))

    self.v.save()

Heroky logs Heroky记录

2017-01-01 10:26:25.634
132 <45>1 2017-01-01T10:26:25.457411+00:00 heroku run.5891 - - Error R14 (Memory quota exceeded)

Go to the log: https://api.heroku.com/myapps/xxx@heroku.com/addons/logentries

You are receiving this email because your Logentries alarm "Memory quota exceeded"
has been triggered.

In context:
2017-01-01 10:26:25.568 131 <45>1 2017-01-01T10:26:25.457354+00:00 heroku run.5891 - - Process running mem=595M(116.2%)
2017-01-01 10:26:25.634 132 <45>1 2017-01-01T10:26:25.457411+00:00 heroku run.5891 - - Error R14 (Memory quota exceeded)

You're basically loading a bunch of data into a Python dictionary in memory. 你基本上是将一堆数据加载到内存中的Python字典中。 This will cause a lot of memory overhead, especially if you are grabbing a lot of objects from the local database. 这将导致大量内存开销,尤其是当您从本地数据库中获取大量对象时。

Do you really need to store all of these objects in a dictionary? 你真的需要将所有这些对象存储在字典中吗?

What most people do for things like this is: 大多数人为这样的事情做的是:

  • Retrieve one object at a time from the database. 一次从数据库中检索一个对象。
  • Process that item (perform whatever logic you need). 处理该项目(执行您需要的任何逻辑)。
  • Repeat. 重复。

This way, you only end up storing a single object in memory at any given time, thereby greatly reducing your memory footprint. 这样,您最终只能在任何给定时间将单个对象存储在内存中,从而大大减少内存占用。

If I were you, I'd look for ways to move my logic into the database query, or simply process each item individually. 如果我是你,我会寻找将逻辑转移到数据库查询中的方法,或者只是单独处理每个项目。

To expand on the veritable rdegges thoughts, here are two strategies I have used in the past when working with celery/python to help reduce the memory footprint: (1) kick off subtasks that each process exactly one object and/or (2) use generators. 为了扩展真正的rdegges想法,以下是我在使用celery / python时帮助减少内存占用的两个策略:(1)启动子任务,每个处理只有一个对象和/或(2)使用发电机。

  1. kick off subtasks that each process exactly one object: 启动子任务,每个处理只有一个对象:

     @app.task def retrieve_details(): qs = PObj.objects.filter(some_condition=True) for p in qs.values_list('id', flat=True): do_fetch.delay(p) @app.task def do_fetch(n_id): p = PObj.objects.get(id=n_id) p.fetch() 

    Now you can tune celery so that it kills of processes after processing N number of PObj (tasks) to keep memory footprint low using --max-tasks-per-child . 现在,您可以调整芹菜,以便在处理N个PObj(任务)后使用--max-tasks-per-child来保持内存占用率低,从而杀死进程。

  2. Using generators: you can also try this using generators so that you can (theoretically) throw away the PObj after you call fetch 使用生成器:您也可以使用生成器尝试此操作,以便您可以(理论上)在调用fetch后丢弃PObj

     def ps_of_interest(chunk=10): n = chunk start = 0 while n == chunk: some_ps = list(PObj.objects.filter(some_condition=True)[start:start + n]) n = len(some_ps) start += chunk for p in some_ps: yield p @app.task def retrieve_details(): for p in ps_of_interest(): p.fetch() 

For my money, I'd go with option #1. 为了我的钱,我会选择#1。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM