简体   繁体   English

什么是最快的ArangoDB朋友的朋友查询(有计数)

[英]What is the fastest ArangoDB friends-of-friends query (with count)

I'm trying to use ArangoDB to get a list of friends-of-friends. 我正在尝试使用ArangoDB来获取朋友的朋友列表。 Not just a basic friends-of-friends list, I also want to know how many friends the user and the friend-of-a-friend have in common and sort the result. 不仅仅是一个基本的朋友朋友列表,我还想知道用户和朋友的朋友有多少朋友,并对结果进行排序。 After several attempts at (re)writing the best performing AQL query, this is what I ended up with: 在多次尝试(重新)编写性能最佳的AQL查询之后,这就是我最终的结果:

LET friends = (
  FOR f IN GRAPH_NEIGHBORS('graph', @user, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
  RETURN f._id
)

LET foafs = (FOR friend IN friends
  FOR foaf in GRAPH_NEIGHBORS('graph', friend, {"direction": "any", "includeData": true, "edgeExamples": { name: "FRIENDS_WITH"}})
    FILTER foaf._id != @user AND foaf._id NOT IN friends
    COLLECT foaf_result = foaf WITH COUNT INTO common_friend_count
    RETURN {
      user: foaf_result,
      common_friend_count: common_friend_count
    }
)
FOR foaf IN foafs
  SORT foaf.common_friend_count DESC
  RETURN foaf

Unfortunately, performance is not as good as I would've liked. 不幸的是,性能并不像我想的那么好。 Compared to the Neo4j versions of the same query(and data), AQL seems quite a bit slower (5-10x). 与同一查询(和数据)的Neo4j版本相比,AQL似乎相当慢(5-10倍)。

What I'd like to know is... How can I improve our query to make it perform better? 我想知道的是......我如何改进查询以使其表现更好?

I am one of the core developers of ArangoDB and tried to optimize your query. 我是ArangoDB的核心开发人员之一,并尝试优化您的查询。 As I do not have your dataset I can only talk about my test dataset and would be happy to hear if you can validate my results. 由于我没有您的dataset我只能谈论我的测试dataset ,并且很高兴听到您是否可以验证我的结果。

First if all I am running on ArangoDB 2.7 but in this particular case I do not expect a major performance difference to 2.6. 首先,如果我在ArangoDB 2.7上运行,但在这种特殊情况下,我不认为2.6的主要性能差异。

In my dataset I could execute your query as it is in ~7sec. 在我的dataset我可以执行您的查询,因为它在~7秒内。 First fix: In your friends statement you use includeData: true and only return the _id . 第一个修复:在你的朋友声明中,你使用includeData: true并且只返回_id With includeData: false GRAPH_NEIGHBORS directly returns the _id and we can also get rid of the subquery here 使用includeData: false GRAPH_NEIGHBORS直接返回_id ,我们也可以在这里删除子查询

LET friends = GRAPH_NEIGHBORS('graph', 
                              @user,
                              {"direction": "any",
                               "edgeExamples": { 
                                   name: "FRIENDS_WITH"
               }})

This got it down to ~ 1.1 sec on my machine. 这使我的机器下降到~1.1秒。 So I expect that this will be close to the performance of Neo4J. 所以我希望这将接近Neo4J的性能。

Why does this have a high impact? 为什么这会产生很大的影响? Internally we first find the _id value without actually loading the documents JSON. 在内部,我们首先找到_id值而不实际加载文档JSON。 In your query you do not need any of this data, so we can safely continue with not opening it. 在您的查询中,您不需要任何此类数据,因此我们可以安全地继续打开它。

But now for the real improvement 但现在真正改善了

Your query goes the "logical" way and first gets users neighbors, than finds their neighbors, counts how often a foaf is found and sorts it. 您的查询采用“逻辑”方式,首先获取用户邻居,而不是找到他们的邻居,计算找到foaf频率并对其进行排序。 This has to build up the complete foaf network in memory and sort it as a whole. 这必须在内存中构建完整的foaf网络并将其整体排序。

You can also do it in a different way: 1. Find all friends of user (only _ids ) 2. Find all foaf (complete document) 3. For each foaf find all foaf_friends (only _ids ) 4. Find the intersection of friends and foaf_friends and COUNT them 你也可以用不同的方式做到:1。找到所有用户的friends (只有_ids )2。查找所有foaf (完整文档)3。对于每个foaf找到所有foaf_friends (只有_ids )4。找到friends的交集foaf_friends和COUNT他们

This query would like this: 这个查询是这样的:

LET fids = GRAPH_NEIGHBORS("graph",
                           @user,
                           {
                             "direction":"any",
                             "edgeExamples": {
                               "name": "FRIENDS_WITH"
                              }
                           }
                          )
FOR foaf IN GRAPH_NEIGHBORS("graph",
                            @user,
                            {
                              "minDepth": 2,
                              "maxDepth": 2,
                              "direction": "any",
                              "includeData": true,
                              "edgeExamples": {
                                "name": "FRIENDS_WITH"
                              }
                            }
                           )
  LET commonIds = GRAPH_NEIGHBORS("graph",
                                  foaf._id, {
                                    "direction": "any",
                                    "edgeExamples": {
                                      "name": "FRIENDS_WITH"
                                     }
                                  }
                                 )
  LET common_friend_count = LENGTH(INTERSECTION(fids, commonIds))
  SORT common_friend_count DESC
  RETURN {user: foaf, common_friend_count: common_friend_count}

Which in my test graph was executed in ~ 0.024 sec 我的测试图中的哪个在~0.024秒内执行

So this gave me a factor 250 faster execution time and I would expect this to be faster than your current query in Neo4j, but as I do not have your dataset I can not verify it, it would be good if you could do it and tell me. 所以这给了我250倍的执行时间,我希望它比你在Neo4j中的当前查询更快,但由于我没有你的dataset我无法验证它,如果你能做到并告诉它会很好我。

One last thing 最后一件事

With the edgeExamples: {name : "FRIENDS_WITH" } it is the same as with includeData , in this case we have to find the real edge and look into it. 使用edgeExamples: {name : "FRIENDS_WITH" }它与includeData相同,在这种情况下,我们必须找到真正的边缘并查看它。 This could be avoided if you store your edges in separate collections based on their name. 如果根据名称将边存储在单独的集合中,则可以避免这种情况。 And then remove the edgeExamples as well. 然后删除edgeExamples。 This will further increase the performance (especially if there are a lot of edges). 这将进一步提高性能(特别是如果有很多边缘)。

Future 未来

Stay tuned for our next release, we are right now adding some more functionality to AQL which will make your case much easier to query and should give another performance boost. 请继续关注我们的下一个版本,我们现在正在为AQL添加更多功能,这将使您的案例更容易查询,并应该提供另一个性能提升。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM