Heroku上的Django Celery任务导致高内存使用率

Question

I have a celery task on Heroku that connects to an external API and retrieves some data, stores in the database and repeats several hundred times. 我在Heroku上有一个celery任务，它连接到一个外部API并检索一些数据，存储在数据库中并重复几百次。 Very quickly (after ~10 loops) Heroku starts warning about high memory usage. 非常快（在~10次循环之后）Heroku开始警告高内存使用率。 Any ideas? 有任何想法吗？

tasks.py tasks.py

@app.task
def retrieve_details():
    for p in PObj.objects.filter(some_condition=True):
        p.fetch()

models.py models.py

def fetch(self):
    v_data = self.service.getV(**dict(
        Number=self.v.number
    ))
    response = self.map_response(v_data)

    for key in ["some_key","some_other_key",]:
        setattr(self.v, key, response.get(key))

    self.v.save()

Heroky logs Heroky记录

2017-01-01 10:26:25.634
132 <45>1 2017-01-01T10:26:25.457411+00:00 heroku run.5891 - - Error R14 (Memory quota exceeded)

Go to the log: https://api.heroku.com/myapps/xxx@heroku.com/addons/logentries

You are receiving this email because your Logentries alarm "Memory quota exceeded"
has been triggered.

In context:
2017-01-01 10:26:25.568 131 <45>1 2017-01-01T10:26:25.457354+00:00 heroku run.5891 - - Process running mem=595M(116.2%)
2017-01-01 10:26:25.634 132 <45>1 2017-01-01T10:26:25.457411+00:00 heroku run.5891 - - Error R14 (Memory quota exceeded)

Answer 1

You're basically loading a bunch of data into a Python dictionary in memory. 你基本上是将一堆数据加载到内存中的Python字典中。 This will cause a lot of memory overhead, especially if you are grabbing a lot of objects from the local database. 这将导致大量内存开销，尤其是当您从本地数据库中获取大量对象时。

Do you really need to store all of these objects in a dictionary? 你真的需要将所有这些对象存储在字典中吗？

What most people do for things like this is: 大多数人为这样的事情做的是：

Retrieve one object at a time from the database. 一次从数据库中检索一个对象。
Process that item (perform whatever logic you need). 处理该项目（执行您需要的任何逻辑）。
Repeat. 重复。

This way, you only end up storing a single object in memory at any given time, thereby greatly reducing your memory footprint. 这样，您最终只能在任何给定时间将单个对象存储在内存中，从而大大减少内存占用。

If I were you, I'd look for ways to move my logic into the database query, or simply process each item individually. 如果我是你，我会寻找将逻辑转移到数据库查询中的方法，或者只是单独处理每个项目。

Answer 2

To expand on the veritable rdegges thoughts, here are two strategies I have used in the past when working with celery/python to help reduce the memory footprint: (1) kick off subtasks that each process exactly one object and/or (2) use generators. 为了扩展真正的rdegges想法，以下是我在使用celery / python时帮助减少内存占用的两个策略：（1）启动子任务，每个处理只有一个对象和/或（2）使用发电机。

kick off subtasks that each process exactly one object: 启动子任务，每个处理只有一个对象：
```
 @app.task def retrieve_details(): qs = PObj.objects.filter(some_condition=True) for p in qs.values_list('id', flat=True): do_fetch.delay(p) @app.task def do_fetch(n_id): p = PObj.objects.get(id=n_id) p.fetch() 
```
Now you can tune celery so that it kills of processes after processing N number of PObj (tasks) to keep memory footprint low using --max-tasks-per-child . 现在，您可以调整芹菜，以便在处理N个PObj（任务）后使用--max-tasks-per-child来保持内存占用率低，从而杀死进程。

Using generators: you can also try this using generators so that you can (theoretically) throw away the PObj after you call fetch 使用生成器：您也可以使用生成器尝试此操作，以便您可以（理论上）在调用fetch后丢弃PObj

 def ps_of_interest(chunk=10): n = chunk start = 0 while n == chunk: some_ps = list(PObj.objects.filter(some_condition=True)[start:start + n]) n = len(some_ps) start += chunk for p in some_ps: yield p @app.task def retrieve_details(): for p in ps_of_interest(): p.fetch()

For my money, I'd go with option #1. 为了我的钱，我会选择＃1。

Heroku上的Django Celery任务导致高内存使用率

问题描述

2 个解决方案

解决方案1
7 2017-01-02 04:01:38

解决方案2
3 已采纳 2017-01-07 18:26:14

Heroku上的Django Celery任务导致高内存使用率

问题描述

2 个解决方案

解决方案1 7 2017-01-02 04:01:38

解决方案2 3 已采纳 2017-01-07 18:26:14

解决方案1
7 2017-01-02 04:01:38

解决方案2
3 已采纳 2017-01-07 18:26:14