简体   繁体   中英

Memory leak in my Google App Engine code

I have the following code that is trying to loop over a large table (~100k rows; ~30GB)

def updateEmailsInLoop(cursor=None, stats={}):
    BATCH_SIZE=10
    try:
        rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=cursor)
        for index, rawEmail in enumerate(rawEmails):
            stats = process_stats(rawEmail, stats)
        i = 0
        while more and next_cursor:
            rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=next_cursor)
            for index, rawEmail in enumerate(rawEmails):
                stats = process_stats(rawEmail, stats)
            i = (i + 1) %100
            if i == 99:
                logging.info("foobar: Finished 100 more %s", str(stats))
        write_stats(stats)
    except DeadlineExceededError:
        logging.info("foobar: Deadline exceeded")
        for index, rawEmail in enumerate(rawEmails[index:], start=index):
            stats = process_stats(rawEmail, stats)
        if more and next_cursor:
            deferred.defer(updateEmailsInLoop, cursor = next_cursor, stats=stats, _queue="adminStats")

However, I keep getting the following error:

While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.

...and sometimes....

Exceeded soft private memory limit of 128 MB with 154 MB after servicing 9 requests total

I had changed my code so I was always only pulling in 10 entries at any given time, so I don't get why I'm still running out of memory?

There are 3 ways to do this kind of job (iteration on a large set of rows in datastore):

  1. Process 1 batch of x entities and create a task (push queue) using the cursor.
  2. Process 1 batch of x entities and respond to the browser with a bit of javascript that shows the progress and changes window.location to a link that contains the cursor and the current progress. (this is my preferred approach)
  3. Use mapreduce (it's harder to code)(but can be applied on 10M-1B rows)

For most of my apps that i needed this x is usually between 100-500. Here is the code i use for iteration over 1.5m-2m rows to generate some reports or update stuff in my db. For reports i save an entity that contains the information i need in csv format, and at the end, i read all entities, merge them, and delete them. (done this to generate 1.5m rows of excel data) (it's java, but should be easily translated to python):

 resp.getWriter().println("<html><head>");
 resp.getWriter().println(
                    "<script type='text/javascript'>function f(){window.location.href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count="
                            + count + "';}</script>");
 resp.getWriter().println("</head><body onload='f()'>");
 resp.getWriter().println(
                    "<a href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count=" + count + "'>Next page -->" + cursorString + " </a>");
 resp.getWriter().println("</body></html>");

If your "progress" is big and messy, save it in entities (one or more, depending on what you are doing) If you are doing the task version, i recommend to either use task names or to make your tasks idempotent (especially if your counting stuff). If your counting stuff, i recommend saving entities that contain the keys of the entities that you are counting, and at the end, count those.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM