简体   繁体   中英

Overcoming 30 sub-query limit for google datastore

Google datastore started off looking so good and has become so frustrating, but maybe it's just that I'm used to relational databases. I'm pretty new to datastore and nosql in general and have done a ton of research but can't seem to find a solution to this problem.

Assume I have a User class that looks like this

class User{
  @Id
  Long id;
  String firstName, lastName; 
  List<Key<User>> friends;
}

I have another class that will model Events that users have done like so

class Event{
   Key<User> user;
   Date eventTime;
   List<Key<User>> receivers;
}

and now what I'm trying to do is query for events that my friends have done. In a usual relational way I would say :

select * from Event where user in (select friends from User where id = ?)

Taking that as a starting point I tried doing

// Key<User> userKey = ...
User user = ofy.load.type(User.class).key(userKey).first.now;
List<Key<User>> friends = user.getFriends();
ofy.load.type(Event.class).filter("user in", friends).order("-eventTime")list();

But I heard about this 30 sub-query limit making this unsustainable since I assume eventually someone will have more than 30 friends, not to mention using an 'in' clause will guarantee that you cannot get a cursor to continue loading events. I've done so much research and tried so many options but have yet to find a good way to approach this problem except to say "why Google, why."

Things I've considered :

  • add an extra field in event that is a copy of the users friendlist and use a single equals on MVP to find events (extremely wasteful since there may be many many events.
  • split event query up into batches of 30 friends at a time and somehow determine a way to ensure continued retrieval from a synthetic cursor based on time, and merge them (problem is waay too many edge cases and makes reading events very difficult.)

I would really appreciate any input you could offer since I am 100% out of ideas

TL;DR ~ GAE has limit on how many items an in-clause can handle and fml.

You come from a relational database background, so the concept of denormalization is probably a bit painful - I know it was for me.

Right now you have a single table that contains all events from all users. This approach works well in relational databases but is a nightmare in the datastore for the reasons you named.

So to solve this concrete problem you could restructure your data as follows:

  • All users have two timelines. One for their own posts and one from friends' posts. (There could be a third timeline for public stuff.)
  • When a new event is published, it is written to the timeline of the user who created it, and to all the timelines of the receiving users. (You may want to add references of the third-party timelines in the user's timeline, so you know what to delete when the user decides to delete an event)

Now every user has access to complete timelines, his/her own and the timeline that was created by third-party events. Those timelines are easy to query and you will not require sub-selects at all.

There are downsides to this approach:

  1. Writing cost is higher. You have to write way more timelines than you had to until now. You will probably have to put this in a task queue to have enough time to write to all those timelines.
  2. You're using a lot more storage, BUT storage is really cheap, I'm guessing the storage will be cheaper than running expensive queries in the long run.

What you get in return though is lightning fast responses with simple queries through this denormalization. All that remains is to merge the responses from the different timelines in the UI (you can do it on the server side, but i would do it in the UI)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM