简体   繁体   English

对于数据库查询,返回与您所关注的人的Twitter推文源类似的结果的最佳查询方法是什么?

[英]What is the best approach for database queries that return results similar to Twitter's feed of tweets by people you follow?

My website lets users submit posts and subscribe to posts by other people. 我的网站允许用户提交帖子并订阅其他人的帖子。 The homepage of the site displays the most recent posts by the people the user follows. 网站首页显示了用户关注的人的最新帖子。 There is no limit to the number of the people a user can follow. 用户可以跟随的人数没有限制。 Some users are following thousands of other users. 一些用户正在关注成千上万的其他用户。 Some users have made more than 15,000 posts. 一些用户发表了15,000多个帖子。

The posts database table is is organized like this (a few irrelevant columns are omitted for clarity): 帖子数据库表的组织方式如下(为清楚起见,省略了一些无关的列):

id
author_id
post_content
date_added

I have 2 working solutions, but I'm not sure if either is the best approach: 我有2种有效的解决方案,但是我不确定哪一种是最好的方法:

Solution 1: 解决方案1:

  1. Get the list of author_ids a user is following. 获取用户关注的author_ids列表。
  2. Query the table for posts that match any of the author_ids: 在表格中查询与任何author_ids匹配的帖子:

      SELECT id FROM posts WHERE author_id IN (12, 34, 56, 78, 90, ...) ORDER BY date_time DESC LIMIT 100; 
  3. Cache the result for N minutes. 将结果缓存N分钟。

This works, but crawls when users are following thousands of people. 这是可行的,但是当用户关注成千上万的人时会爬行。

Solution 2: 解决方案2:

  1. Get the list of author_ids a user is following. 获取用户关注的author_ids列表。
  2. For each author id, get the cached feed of just their post ids. 对于每个作者ID,仅获取其帖子ID的缓存提要。 (This feed is used on an author's page) (此提要在作者页面上使用)
  3. Merge all the post ids from all of these authors into one giant array and sort them in descending order (which happens to work because each post gets an auto-incremented id). 将所有这些作者的所有帖子ID合并到一个巨大的数组中,并按降序对其进行排序(之所以起作用,是因为每个帖子都有一个自动递增的ID)。
  4. Cache and return the most recent 100 post ids; 缓存并返回最近的100个帖子ID;

This works, but sometimes crawls when thousands of user feeds are returned and merged into an array with 100,000+ items. 这可行,但是当返回数千个用户供稿并将其合并到具有100,000多个项目的数组时,有时会爬网。 It feels like overkill when all I care about is the most recent 100 items. 当我只关心最近的100件商品时,这感觉就像是杀了我。 Additionally, not all user feeds will be in cache. 此外,并非所有用户供稿都将在缓存中。 Some old users may no longer use the site, but are still followed by new users resulting in the old user's feed to be freshly queried (and then cached). 一些老用户可能不再使用该网站,但仍然跟随着新用户,导致重新查询(然后缓存)了老用户的供稿。

Are these the optimal solutions? 这些是最佳解决方案吗? If not, what is? 如果没有,那是什么?

What about (untested, but you get the idea): 怎么样(未经测试,但您知道了):

SELECT id FROM posts
CROSS JOIN followers ON posts.author_id = followers.user_id
WHERE followers.followed_by_user_id = INSERT_USER_ID_HERE
ORDER BY posts.date_time DESC
LIMIT 100;

or 要么

SELECT id FROM posts
WHERE author_id IN (
  SELECT user_id FROM followers 
  WHERE followed_by_user_id = INSERT_USER_ID_HERE
)
ORDER BY date_time DESC
LIMIT 100;

note: to clarify, the table followers contains two columns user_id and followed_by_user_id . 注意:为澄清起见,表followers包含两列user_idfollowed_by_user_id If a row contains the value ( user_id:7 , followed_by_user_id:42 ), it means that user 42 follows user 7. 如果一行包含值( user_id:7followed_by_user_id:42 ),则意味着用户42跟随用户7。

An optimization for your Solution 2 which avoids merging and sorting all the post ids: 解决方案2的一种优化,避免了对所有帖子ID进行合并和排序:

  1. Create an array to hold the result and copy the contents of the first author's top-100 post ids and sort by id . 创建一个数组来保存结果,并通过复制第一作者的前100后IDS和排序的内容id
  2. For each author: 对于每个作者:
    1. Check if the minimum id in the result array is greater than the maximum id of the author's posts. 检查结果数组中的最小id是否大于作者帖子的最大id
    2. If yes, then skip that author since all his posts are older than the posts in your result array. 如果是,请跳过该作者,因为他的所有帖子都早于您结果数组中的帖子。
    3. If no, then merge the top-100 posts of the author with your result array, sort and then retain only the top 100 posts. 如果否,则将作者的前100个帖子与您的结果数组合并,排序并仅保留前100个帖子。

Also, you could maintain an array with the maximum post id of every author. 另外,您可以维护一个数组,其中每个作者的帖子ID都应为最大值。 Before fetching the top-100 posts of an author, you could check this array. 在获取作者的前100名帖子之前,您可以检查一下此数组。 This will avoid fetching/caching the posts of inactive users. 这将避免获取/缓存不活动用户的帖子。


For Solution 1 , ordering by id will be a bit faster than ordering by date_time . 对于解决方案1 ,按id排序将比按date_time排序快一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM