简体   繁体   中英

Memory leak in Google App Engine / Datastore / Flask / Python app

I have built a simple news aggregator site, in which the memory usage of all my App Engine instances keep growing until reaching the limit and therefore being shut down.

I have started to eliminate everything from my app to arrive at a minimal reproducible version. This is what I have now:


app = Flask(__name__)

datastore_client = datastore.Client()

@app.route('/')
def root():
    
    query = datastore_client.query(kind='source')
    query.order = ['list_sequence']
    sources = query.fetch() 
    
    for source in sources:
        pass
    

Stats show a typical saw-tooth pattern: at instance startup, it goes to 190 - 210 Mb, then upon some requests, but NOT ALL requests, memory usage increases by 20 - 30 Mb. (This, by the way, roughly corresponds to the estimated size of the query results, although I cannot be sure this is relevant info.) This keeps happening until it exceeds 512 Mb, when it is shut down. It usually happens at around the 50th - 100th request to "/". No other requests are made to anything else in the meantime.

Now, if I eliminate the "for" cycle, and only the query remains, the problem goes away, the memory usage remains at 190 Mb flat, no increase even after 100+ requests.

gc.collect() at the end does not help. I have also tried looking at the difference in tracemalloc stats at the beginning and end of the function, I have not found anything useful.

Has anyone experienced anything similar, please? Any ideas what might go wrong here? What additional tests / investigations can you recommend? Is this possibly a Google App Engine / Datastore issue I have no control of?

Thank you.

在此处输入图片说明

Now, if I eliminate the "for" cycle, and only the query remains, the problem goes away, the memory usage remains at 190 Mb flat, no increase even after 100+ requests.

query.fetch() returns an iterator, not an actual array of the results

https://googleapis.dev/python/datastore/latest/queries.html#google.cloud.datastore.query.Query.fetch

Looking at the source code, it looks like this iterator has code for fetching the next pages of the query. So you're for-loop forces it to fetch all the pages of the results. In fact I don't think it actually fetches anything until you start iterating. So that would be why removing your for-loop would make a difference

Unfortunately beyond that I'm not sure, since as you dig through the source code you pretty quickly run into the GRPC stubs and it's unclear if the problem is in there.

There is this question that is similar to yours, where the asker found a memory leak involved with instantiating datastore.Client() . How should I investigate a memory leak when using Google Cloud Datastore Python libraries?

This ultimately got linked to an issue in GRPC where GRPC would leak if it doesn't get closed https://github.com/grpc/grpc/issues/22123

Hopefully, this points you in the right direction

@Alex in the other answer did a pretty good research, so I will follow up with this recommendation: try using the NDB Library . All calls with this library have to be wrapped into a context manager, which should guarantee cleaning up after closing. That could help fix your problem:

ndb_client = ndb.Client(**init_client)

with ndb_client.context():
    query = MyModel.query().order(MyModel.my_column)
    sources = query.fetch()
    for source in sources:
        pass

# if you try to query DataStore outside the context manager, it will raise an error
query = MyModel.query().order(MyModel.my_column)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM