Performance of search query bottlenecked 98% by mutual friends despite caching

Question

So on my social networking website, similar to facebook, my search speed is bottlenecked like 98% by this one part. I want to rank the results based on the number of mutual friends the searching user has, with all of the results (we can assume they are users)

My friends table has 3 columns -

user_id (person who sends the request)
friend_id (person who receives the request)
pending (boolean to indicate if the request was accepted or not)

user_id and friend_id are both foreign keys that reference users.id

Finding friend_ids of a user is simple, it looks like this

def friends
  Friend.where(
    '(user_id = :id OR friend_id = :id) AND pending = false',
     id: self.id
  ).pluck(:user_id, :friend_id)
   .flatten
   .uniq
   .reject { |id| id == self.id }
end

So, after getting the results that match the search query, ranking the results by mutual friends, requires following steps -

Get user_ids of all the searching user's friends - Set(A). Above mentioned friends method does this
Loop over each of the ids in Set(A) -
- Get user_ids of all the friends of |id| - Set (B). Again, done by friends method
- Find length of intersection of set A and set B
Order in descending order of length of intersections for all results

The most expensive operation over here obviously getting friend_ids of of hundreds of users. So I cached the friend_ids of all the users to speed it up. The difference in performance was amazing, but I'm curious if it can be further improved.

I'm wondering if there is a way that I can get friend_ids of all the desired users in a single query, that is efficient. Something like -

SELECT user_id, [array of friend_ids of the user with id = user_id]
FROM friends
....

Can someone help me write a fast SQL or ActiveRecord query for this?

That way I can store the user_ids of all the search results and their corresponding friend_ids in a hash or some other fast data structure, and then perform the same operation of ranking (that I mentioned above). Since I won't be hitting the cache for thousands of users and their friend_ids, I think it'll speed up the process significantly

Answer 1

Caching your friends table in RAM is not a viable approach if you expect your site to grow to large numbers of users, but I'm sure it does great for a smallish number of users.

It is to your advantage to get the most work you can out of the database with as few calls as possible. It is inefficient to issue large numbers of queries, as the overhead per query as comparatively large. Moreover, databases are built for the kind of task you're trying to perform. I think you are doing far too much work on the Ruby side, and you ought to let the database do the kind of work it does best.

You did not give many details, so I decided to start by defining a minimal model DB:

create table users (
  user_id int not null primary key,
  nick varchar(32)
  );

create table friends (
  user_id int not null,
  friend_id int not null,
  pending bool,
  primary key (user_id, friend_id),
  foreign key (user_id) references users(user_id),
  foreign key (friend_id) references users(user_id),
  check (user_id < friend_id)
  );

The check constraint on friends avoids the same pair of users being listed in the table in both orders, and of course the PK prevents the same pair from being enrolled multiple times in the same order. The PK also automatically has a unique index associated with it.

Since I suppose the 'is a friend of' relation is supposed to be logically symmetric, it is convenient to define a view that presents that symmetry:

create view friends_symmetric (user_id, friend_id) as (
  select user_id, friend_id from friends where not pending
  union all
  select friend_id, user_id from friends where not pending
  );

(If friendship is not symmetric then you can drop the check constraint and the view, and use table friends in place of friends_symmetric in what follows.)

As a model query whose results you want to rank, then, I take this:

select * from users where nick like 'Sat%';

The objective is to return result rows in descending order of the number of friends each hit has in common with User1, the user on whose behalf the query is run. You might do that like so:

( update : modified this query to filter out duplicate results)

select *
from (
    select
      u.*,
      count(mutual.shared_friend_id) over (partition by u.user_id) as num_shared,
      row_number() over (partition by u.user_id) as copy_num
    from 
      users u
      left join (
          select
            f1.friend_id as shared_friend_id,
            f2.friend_id as friend_id
          from friends_symmetric f1
            join friends_symmetric f2
              on f1.friend_id = f2.user_id
          where f1.user_id = ?
            and f2.friend_id != f1.user_id
        ) mutual
        on u.user_id = mutual.friend_id
    where u.nick like 'Sat%'
  ) all_rows
where copy_num = 1
order by num_shared desc

where the ? is a placeholder for a parameter containing the ID of the User1.

Edited to add:

I have structured this query with window functions instead of an aggregate query with the idea that such a structure will be easier for the query planner to optimize. Nevertheless, the inline view "mutual" could instead be structured as an aggregate query that computes the number of shared friends that the searching user has with every user that shares at least one friend, and that would permit one level of inline view to be avoided. If performance of the provided query is or becomes inadequate, then it would be worthwhile to test that variant.

There are other ways to approach the problem of performing the sorting in the DB, some of which may perform better, and there may be ways to improve the performance of each by tweaking the database (adding indexes or constraints, modifying table definitions, computing db statistics, ...).

I cannot predict whether that query will outperform what you're doing now, but I assure you that it scales better, and it is easier to maintain.

Answer 2

Assuming that you want a relation of the User model whose primary key is id , you should be able to join onto a subquery that calculates the number of mutual friends:

class User < ActiveRecord::Base
  def other_users_ordered_by_mutual_friends
    self.class.select("users.*, COALESCE(f.friends_count, 0) AS friends_count").joins("LEFT OUTER JOIN (
      SELECT all_friends.user_id, COUNT(DISTINCT all_friends.friend_id) AS friends_count FROM (
        SELECT f1.user_id, f1.friend_id FROM friends f1 WHERE f1.pending = false
        UNION ALL
        SELECT f2.friend_id AS user_id, f2.user_id AS friend_id FROM friends f2 WHERE f2.pending = false
      ) all_friends INNER JOIN (
        SELECT DISTINCT f1.friend_id AS user_id FROM friends f1 WHERE f1.user_id = #{id} AND f1.pending = false
        UNION ALL
        SELECT DISTINCT f2.user_id FROM friends f2 WHERE f2.friend_id = #{id} AND f2.pending = false
      ) user_friends ON user_friends.user_id = all_friends.friend_id GROUP BY all_friends.user_id
    ) f ON f.user_id = users.id").where.not(id: id).order("friends_count DESC")
  end
end

The subquery selects all user IDs with associated friends and inner joins that to another select with all of the current user's friends' IDs. Since it groups by the user_id and selects the count, we get the number of mutual friends for each user_id . I have not tested this since I don't have any sample data, but it should work.

Since this returns a scope, you can chain other scopes/conditions to the relation:

current_user.other_users_ordered_by_mutual_friends.where(attribute1: value1).reorder(:attribute2)

The select scope as written will also give you access to the field friends_count on instances within the relation:

<%- current_user.other_users_ordered_by_mutual_friends.each do |user| -%>
  <p>User <%= user.id -%> has <%= user.friends_count -%> mutual friends.</p>
<%- end -%>

Answer 3

John had a great idea with the friends_symetric view. With two filtered indexes (one on (friend_id,user_id and the other on (user_id,friend_id) ) it's gonna work great. However the query can be a bit simpler

WITH user_friends AS(
  SELECT user_id, array_agg(friend_id) AS friends
    FROM friends_symmetric
        WHERE user_id = :user_id -- id of our user
    GROUP BY user_id
)
SELECT u.*
       ,array_agg(friend_id) AS shared_friends -- aggregated ids of friends in case they are needed for something
       ,count(*) AS shared_count    
FROM user_friends AS uf     
    JOIN friends_symmetric AS f
        ON f.user_id = ANY(uf.friends) AND f.friend_id = ANY(uf.friends)
    JOIN user
        ON u.user_id = f.user_id
WHERE u.nick LIKE 'Sat%' --nickname of our user's friend
GROUP BY u.user_id

Performance of search query bottlenecked 98% by mutual friends despite caching

Question

3 answers

solution1
1 2015-09-09 18:27:09

solution2
0 2015-09-09 18:47:08

solution3
0 2015-09-10 21:08:33

Performance of search query bottlenecked 98% by mutual friends despite caching

Question

3 answers

solution1 1 2015-09-09 18:27:09

solution2 0 2015-09-09 18:47:08

solution3 0 2015-09-10 21:08:33

solution1
1 2015-09-09 18:27:09

solution2
0 2015-09-09 18:47:08

solution3
0 2015-09-10 21:08:33