Memory leak in my Google App Engine code

Question

I have the following code that is trying to loop over a large table (~100k rows; ~30GB)

def updateEmailsInLoop(cursor=None, stats={}):
    BATCH_SIZE=10
    try:
        rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=cursor)
        for index, rawEmail in enumerate(rawEmails):
            stats = process_stats(rawEmail, stats)
        i = 0
        while more and next_cursor:
            rawEmails, next_cursor, more = RawEmailModel.query().fetch_page(BATCH_SIZE, start_cursor=next_cursor)
            for index, rawEmail in enumerate(rawEmails):
                stats = process_stats(rawEmail, stats)
            i = (i + 1) %100
            if i == 99:
                logging.info("foobar: Finished 100 more %s", str(stats))
        write_stats(stats)
    except DeadlineExceededError:
        logging.info("foobar: Deadline exceeded")
        for index, rawEmail in enumerate(rawEmails[index:], start=index):
            stats = process_stats(rawEmail, stats)
        if more and next_cursor:
            deferred.defer(updateEmailsInLoop, cursor = next_cursor, stats=stats, _queue="adminStats")

However, I keep getting the following error:

While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.

...and sometimes....

Exceeded soft private memory limit of 128 MB with 154 MB after servicing 9 requests total

I had changed my code so I was always only pulling in 10 entries at any given time, so I don't get why I'm still running out of memory?

Answer 1

There are 3 ways to do this kind of job (iteration on a large set of rows in datastore):

Process 1 batch of x entities and create a task (push queue) using the cursor.
Process 1 batch of x entities and respond to the browser with a bit of javascript that shows the progress and changes window.location to a link that contains the cursor and the current progress. (this is my preferred approach)
Use mapreduce (it's harder to code)(but can be applied on 10M-1B rows)

For most of my apps that i needed this x is usually between 100-500. Here is the code i use for iteration over 1.5m-2m rows to generate some reports or update stuff in my db. For reports i save an entity that contains the information i need in csv format, and at the end, i read all entities, merge them, and delete them. (done this to generate 1.5m rows of excel data) (it's java, but should be easily translated to python):

 resp.getWriter().println("<html><head>");
 resp.getWriter().println(
                    "<script type='text/javascript'>function f(){window.location.href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count="
                            + count + "';}</script>");
 resp.getWriter().println("</head><body onload='f()'>");
 resp.getWriter().println(
                    "<a href='/do/convert/" + this.getClass().getSimpleName() + "?cursor=" + cursorString + "&count=" + count + "'>Next page -->" + cursorString + " </a>");
 resp.getWriter().println("</body></html>");

If your "progress" is big and messy, save it in entities (one or more, depending on what you are doing) If you are doing the task version, i recommend to either use task names or to make your tasks idempotent (especially if your counting stuff). If your counting stuff, i recommend saving entities that contain the keys of the entities that you are counting, and at the end, count those.

Memory leak in my Google App Engine code

Question

1 answers

solution1
0 2015-05-14 16:58:44

Memory leak in my Google App Engine code

Question

1 answers

solution1 0 2015-05-14 16:58:44

solution1
0 2015-05-14 16:58:44