How to index this data for Elasticsearch

Question

I'm essentially trying to mimic joins in ES and I know this is not a great use-case for ES, but at the same time, what I'm trying to accomplish doesn't seem out of the ordinary for a search feature. I've read through ES docs, blog posts, drawn diagrams, and of course tested different scenarios locally, but I'm still having a hard time wrapping my head around how to index this data for ES. This is my first ES project and my first real interactions with noSQL kind of environment.

Imagine a social recipe site (for simplicity)...

Users can post original recipes. Other Users can "like" (once), "save to" (multiple saves to different categories), and "cook" (can cook a recipe multiple times) recipes.

Users can search recipes and filter them on different simple flags such as "a recipe has been cooked at least once" as well as whether or not they've liked, saved, and/or cooked the recipe. Additionally, when you view a User's profile, you can search the recipes that they have liked, saved, and cooked. As well as the recipes that User has created.

The current setup, which works, but does not seem to be scalable, is that a Recipe is indexed with its various yes/no flags, as well as one field each for liked_by_users, saved_by_users, and cooked_by_users. These fields hold an array of user_ids who have taken any of those actions on a recipe. Then, when I want to filter, I pass the user_id (or user_ids if you want to see what any of your friends have cooked, for example) and filter results on whether or not the id shows up in the relevant array(s). However, if there can ultimately be millions of these interactions, it doesn't seem like storing and searching this way is great. I could also store the recipe ids on the User but in the end, it seems like I would end up with a similar problem and I would have the added hassle of needing to query those ids from the User first.

What I have been trying and/or thinking about:

Denormalize everything. I think this is the preferred ES way, but I'm worried that this is just so much duplicated data (can search recipe titles, content, categories, etc) and some of it changes frequently. For example, if a User likes a recipe, then the liked count of that recipe is updated so that results can be sorted by like count.

I believe this would require creating a copy of the recipe for ever user who has interacted with it and then store the interactions there. So, a flag for liked, and array of data for categories it's been saved to, and an array of data for times it has been cooked. I believe I would still need to pass in an array of user_ids to filter on if someone was filtering by anything their friends had cooked, but I don't think Users will have millions of friends, likely to be under 200. Is that still too many ids to pass in? Is saving that much data TOO much data? Also the fact that there are fields that may be frequently updated makes this sound extra awful.

Nesting recipes under a User also doesn't sound correct since everything needs to be reindexed when anything else is updated.
In the docs, Parent/Child sounded like an option of last resort and also doesn't sound quiet right for this use case.
I've thought about pulling the ids to filter on from mySQL (ie recipe ids the User has interacted with) and passing those to ES. However, one, mySQL can only concatenate so many ids (and similarly, unsure if it would be wise to build them into a string in code if they are too long for mySQL), and two, I'm not sure that this is an efficient way to filter the ES results (too much data).

I've experimented with some other things such as indexing the relationships between users and recipes separately, but everything just seems to come down to crazy town.

I also don't have a good idea of how much is too much for ES. Reading through the docs, there are mentions of "this isn't a good idea if you have many XYZ" but I don't know what many means in these cases. The only concrete part I read was about updating the names of Users in denormalized blog posts and that updating "a few thousand" would take less than a second. Are there any rules of thumbs I can use to estimate how big is to big for things like data stored in a field, or data passed to filter by, or docs to update?

Answer 1

This is quite tricky to implement on Elasticsearch as the entities (users, recipes, categories, ...) are linked together in various ways and updating this data at high through-put without race conditions isn't trivial.

Are categories shared among users? I mean when a recipe is saved to a category (like tagging), is this information visible to everybody? If so, this structure should get you started.

Sounds like you should have two types of documents: recipes and cooking actions / user / recipe.

Recipe structure:

{
  "_id": "rga9gia0934gau90" (could be auto-generated by ES)
  "created_by": 123         (user id)
  "contents": "Pour x grams of sugar..."
  "ingredients": ["sugar", "..."],
  "tags": ["unhealthy", "sweet", "..."]
}

Cooking dates structure:

{
  "_id": "123-rga9gia0934gau90" (generated as {user_id}-{recipe-id})
  "user_id": 123,
  "recipe_id": "rga9gia0934gau90",
  "cooked_at_dates": ["2017-01-02", "2017-01-07"],
  "cooked_n_times": 2
}

This way most updates are local to a single document. However some queries such as "sweet recipes user X has not cooked yet" require two ES queries: first to get recipe ids of all recipes the user has cooked, and the second query to find all sweet recipes which don't have any of the listed ids. This wouldn't scale to tens of thousands of recipes but should work fine for hundreds or thousands.

How to index this data for Elasticsearch

Question

1 answers

solution1
0 ACCPTED 2017-03-03 12:58:27

How to index this data for Elasticsearch

Question

1 answers

solution1 0 ACCPTED 2017-03-03 12:58:27

solution1
0 ACCPTED 2017-03-03 12:58:27