Datastore NDB best practices when querying and extracting thousands of rows

Question

I'm using the High Replication Datastore, along with ndb . I have a kind with over 27,000 entities, which isn't that much. Supposedly the datastore is efficient in querying and extracting large amounts of data, but whenever I query over that kind, queries take a long time to finish (I've even got DeadlineExceededErrors).

I have a model where I store keywords and URLs I want to index in Google:

class Keywords(ndb.Model):
    keyword = ndb.StringProperty(indexed=True)
    url = ndb.StringProperty(indexed=True)
    number_articles = ndb.IntegerProperty(indexed=True)
    # Some other attributes... All attributes are indexed

My current use cases are to build my Sitemap, and to fetch my top 20 keywords to link from my hope page.

When I fetch many entities, I usually do:

Keywords.query().fetch() # For the sitemap, as I want all of the urls
Keywords.query(Keywords.number_articles > 5).fetch() # For the homepage, I want to link to keywords with more than 5 articles

Is there a better way to extract data?

I've tried to index data into the Search API, and I've seen huge speed gains. Even though this works, I don't think it's ideal to replicate data from the Datastore into Search API with basically the same fields.

Thanks in advance!

Answer 1

DB speed is related to the number of results returned, not the number of records in the DB. You say:

to build my Sitemap, and to fetch my top 20 keywords

If thats the case add limit=20 in both fetches. If you do it that way then use run instead as per the docs:

https://developers.google.com/appengine/docs/python/datastore/queryclass#Query_fetch

Answer 2

I would split this functionality.

For home page you can use your second query, but add, as advised by Bruyere, limit=20 paramater. Such request should run very fast, if you have the right index.

The site map is a bigger issue. Usually, to process large number of entities, you use Map reduce . It's probably a good idea, but only if you don't have too many requests to sitemap. It can also be the only solution if you update Keywords entities often and want as up to date site map as possible.

Another option can be to generate sitemap in a task, save it as a blob and serve this blob in the request. That is really quick. If your updates to the Keywords entities are not very frequent, then you can run this task after any update. If you have many updates, then you can schedule the task to run periodically in cron. As you have success using search API, then this is probably the best option for you.

Generally speaking I don't think it's a good idea to use datastore to retrieve large amounts of data. I recommend to look at least at Datastore comparison with traditional databases . It's designed to handle large databases, but not necessarily large result sets. I would say that datastore is designed to handle large amounts of small requests.

Datastore NDB best practices when querying and extracting thousands of rows

Question

2 answers

solution1
1 2014-08-11 13:23:43

solution2
1 2014-09-25 06:48:32

Datastore NDB best practices when querying and extracting thousands of rows

Question

2 answers

solution1 1 2014-08-11 13:23:43

solution2 1 2014-09-25 06:48:32

solution1
1 2014-08-11 13:23:43

solution2
1 2014-09-25 06:48:32