简体   繁体   English

arangodb Facet计算/聚合缓慢吗?

[英]arangodb Facet calculation / aggregation slow?

I wonder why the following facet calculation takes so slow: 我不知道为什么以下方面计算如此缓慢:

FOR q IN LRQ  
    COLLECT profile = q.LongRunningQuery.Profile INTO profiles 
RETURN { "Profile" : profile, "Count" : LENGTH(profiles)} 

It takes about 30 seconds, although only 5.000 documents are in the db, and only 30 different facets are in the result. 尽管大约只有5,000个文档在数据库中,并且结果只有30个不同的方面,但这大约需要30秒。

The field LongRunningQuery.Profile is indexed with a hash index, and with a skiplist index. LongHunningQuery.Profile字段使用哈希索引和跳过列表索引进行索引。 (I also tried with different combinations of them). (我也尝试过使用它们的不同组合)。

Has anybody a hint for me what might go wrong? 有没有人给我提示可能出什么问题? Is it possible that the queries do not benefit from the indexes? 查询是否可能无法从索引中受益? (the 5.000 records are about 1 GB of size, so I assume the hash index will not be used, maybe a fulltable scan instead?) (5,000条记录的大小约为1 GB,因此我假设将不使用哈希索引,而可能使用全表扫描吗?)

Interestingly, the following alternative only lasts 2 seconds: 有趣的是,以下替代方案仅持续2秒:

FOR q IN SKIPLIST(LRQ, { "LongRunningQuery.Profile": [ [ '>',  ''  ] ] })[*].LongRunningQuery.Profile
    COLLECT profile = q INTO profiles
RETURN { "Profile" : profile, "Count" : LENGTH(profiles) } 

But it still needs 2 seconds - for such a small amount of records. 但是对于如此少量的记录,它仍然需要2秒。 Here it looks like the skiplist index is used, but it is maybe not the perfect index variant. 在这里看起来好像使用了跳过列表索引,但它可能不是完美的索引变体。


Update 2014-11-27: 更新2014-11-27:

arangosh [_system]> stmt._query
    FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN {
     "Profile" : profile, "Count" : LENGTH(profiles)}

arangosh [_system]> db.LRQ.ensureHashIndex("LongRunningQuery.Profile");
{
  "id" : "LRQ/296017913256",
  "type" : "hash",
  "unique" : false,
  "fields" : [
    "LongRunningQuery.Profile"
  ],
  "isNewlyCreated" : false,
  "error" : false,
  "code" : 200
}

The query took about 32 seconds and returned 31 short results. 查询耗时约32秒,返回31个简短结果。

Execution plan: 执行计划:

    {
        "plan": {
            "nodes": [
                {
                    "type": "SingletonNode",
                    "dependencies": [],
                    "id": 1,
                    "estimatedCost": 1,
                    "estimatedNrItems": 1
                },
                {
                    "type": "EnumerateCollectionNode",
                    "dependencies": [
                        1
                    ],
                    "id": 2,
                    "estimatedCost": 5311,
                    "estimatedNrItems": 5310,
                    "database": "_system",
                    "collection": "LRQ",
                    "outVariable": {
                        "id": 0,
                        "name": "q"
                    }
                },
                {
                    "type": "CalculationNode",
                    "dependencies": [
                        2
                    ],
                    "id": 3,
                    "estimatedCost": 10621,
                    "estimatedNrItems": 5310,
                    "expression": {
                        "type": "attribute access",
                        "name": "Profile",
                        "subNodes": [
                            {
                                "type": "attribute access",
                                "name": "LongRunningQuery",
                                "subNodes": [
                                    {
                                        "type": "reference",
                                        "name": "q",
                                        "id": 0
                                    }
                                ]
                            }
                        ]
                    },
                    "outVariable": {
                        "id": 3,
                        "name": "3"
                    },
                    "canThrow": false
                },
                {
                    "type": "SortNode",
                    "dependencies": [
                        3
                    ],
                    "id": 4,
                    "estimatedCost": 56166.713176593075,
                    "estimatedNrItems": 5310,
                    "elements": [
                        {
                            "inVariable": {
                                "id": 3,
                                "name": "3"
                            },
                            "ascending": true
                        }
                    ],
                    "stable": true
                },
                {
                    "type": "AggregateNode",
                    "dependencies": [
                        4
                    ],
                    "id": 5,
                    "estimatedCost": 61476.713176593075,
                    "estimatedNrItems": 5310,
                    "aggregates": [
                        {
                            "outVariable": {
                                "id": 1,
                                "name": "profile"
                            },
                            "inVariable": {
                                "id": 3,
                                "name": "3"
                            }
                        }
                    ],
                    "outVariable": {
                        "id": 2,
                        "name": "profiles"
                    }
                },
                {
                    "type": "CalculationNode",
                    "dependencies": [
                        5
                    ],
                    "id": 6,
                    "estimatedCost": 66786.71317659307,
                    "estimatedNrItems": 5310,
                    "expression": {
                        "type": "array",
                        "subNodes": [
                            {
                                "type": "array element",
                                "name": "Profile",
                                "subNodes": [
                                    {
                                        "type": "reference",
                                        "name": "profile",
                                        "id": 1
                                    }
                                ]
                            },
                            {
                                "type": "array element",
                                "name": "Count",
                                "subNodes": [
                                    {
                                        "type": "function call",
                                        "name": "LENGTH",
                                        "subNodes": [
                                            {
                                                "type": "list",
                                                "subNodes": [
                                                    {
                                                        "type": "reference",
                                                        "name": "profiles",
                                                        "id": 2
                                                    }
                                                ]
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    },
                    "outVariable": {
                        "id": 4,
                        "name": "4"
                    },
                    "canThrow": false
                },
                {
                    "type": "ReturnNode",
                    "dependencies": [
                        6
                    ],
                    "id": 7,
                    "estimatedCost": 72096.71317659307,
                    "estimatedNrItems": 5310,
                    "inVariable": {
                        "id": 4,
                        "name": "4"
                    }
                }
            ],
            "rules": [],
            "collections": [
                {
                    "name": "LRQ",
                    "type": "read"
                }
            ],
            "variables": [
                {
                    "id": 0,
                    "name": "q"
                },
                {
                    "id": 1,
                    "name": "profile"
                },
                {
                    "id": 4,
                    "name": "4"
                },
                {
                    "id": 2,
                    "name": "profiles"
                },
                {
                    "id": 3,
                    "name": "3"
                }
            ],
            "estimatedCost": 72096.71317659307,
            "estimatedNrItems": 5310
        },
        "warnings": []
    }

Update 2014-12-05: 2014年12月5日更新:

Here are additional measures: Understood, thanks. 以下是其他措施:理解,谢谢。 Here's the output: 这是输出:

Execution of AQL_EXECUTE('FOR q IN LRQ FILTER q.LongRunningQuery.Profile == "Admin" LIMIT 1 RETURN q.LongRunningQuery.Profile', {}, { profile : true }).profile --> { "initializing" : 0, "parsing" : 0, "optimizing ast" : 15.364980936050415, "instanciating plan" : 0, "optimizing plan" : 0, "executing" : 0 } AQL_EXECUTE('FOR q LRQ FILTER q.LongRunningQuery.Profile ==“ Admin” LIMIT 1 RETURN q.LongRunningQuery.Profile',{},{profile:true})。profile-> {“初始化”:0 ,“解析”:0,“优化ast”:15.364980936050415,“实例化计划”:0,“优化计划”:0,“执行中”:0}

Execution of AQL_EXECUTE('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN { "Profile" : profile, "Count" : LENGTH(profiles)}', {}, { profile : true }).profile --> { "initializing" : 0, "parsing" : 0, "optimizing ast" : 0, "instanciating plan" : 0, "optimizing plan" : 0, "executing" : 77.88313102722168 } 执行AQL_EXECUTE('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profile RETURN {“ Profile”:profile,“ Count”:LENGTH(profiles)}',{},{profile:true})。profile- -> {“正在初始化”:0,“正在解析”:0,“优化ast”:0,“实例化计划”:0,“优化计划”:0,“正在执行”:77.88313102722168}

Update 19.12.2014: 2014年12月19日更新:

Since 2.3.2 the execution plan for the query arangosh [_system]> stmt2 = db._createStatement('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profiles RETURN { "Profile" : profile, "Count" : LENGTH(profiles)} ') 从2.3.2版本开始,查询arangosh [_system]> stmt2 = db._createStatement('FOR q IN LRQ COLLECT profile = q.LongRunningQuery.Profile INTO profile RETURN {“” Profile:“,” Count“:LENGTH(个人资料)}')

looks like this: 看起来像这样:

arangosh [_system]> stmt2.explain()
{
  "plan" : {
    "nodes" : [
      {
        "type" : "SingletonNode",
        "dependencies" : [ ],
        "id" : 1,
        "estimatedCost" : 1,
        "estimatedNrItems" : 1
      },
      {
        "type" : "IndexRangeNode",
        "dependencies" : [
          1
        ],
        "id" : 8,
        "estimatedCost" : 5311,
        "estimatedNrItems" : 5310,
        "database" : "_system",
        "collection" : "LRQ",
        "outVariable" : {
          "id" : 0,
          "name" : "q"
        },
        "ranges" : [
          [ ]
        ],
        "index" : {
          "type" : "skiplist",
          "id" : "530975525379",
          "unique" : false,
          "fields" : [
            "LongRunningQuery.Profile"
          ]
        },
        "reverse" : false
      },
      {
        "type" : "CalculationNode",
        "dependencies" : [
          8
        ],
        "id" : 3,
        "estimatedCost" : 10621,
        "estimatedNrItems" : 5310,
        "expression" : {
          "type" : "attribute access",
          "name" : "Profile",
          "subNodes" : [
            {
              "type" : "attribute access",
              "name" : "LongRunningQuery",
              "subNodes" : [
                {
                  "type" : "reference",
                  "name" : "q",
                  "id" : 0
                }
              ]
            }
          ]
        },
        "outVariable" : {
          "id" : 3,
          "name" : "3"
        },
        "canThrow" : false
      },
      {
        "type" : "AggregateNode",
        "dependencies" : [
          3
        ],
        "id" : 5,
        "estimatedCost" : 15931,
        "estimatedNrItems" : 5310,
        "aggregates" : [
          {
            "outVariable" : {
              "id" : 1,
              "name" : "profile"
            },
            "inVariable" : {
              "id" : 3,
              "name" : "3"
            }
          }
        ],
        "outVariable" : {
          "id" : 2,
          "name" : "profiles"
        }
      },
      {
        "type" : "CalculationNode",
        "dependencies" : [
          5
        ],
        "id" : 6,
        "estimatedCost" : 21241,
        "estimatedNrItems" : 5310,
        "expression" : {
          "type" : "array",
          "subNodes" : [
            {
              "type" : "array element",
              "name" : "Profile",
              "subNodes" : [
                {
                  "type" : "reference",
                  "name" : "profile",
                  "id" : 1
                }
              ]
            },
            {
              "type" : "array element",
              "name" : "Count",
              "subNodes" : [
                {
                  "type" : "function call",
                  "name" : "LENGTH",
                  "subNodes" : [
                    {
                      "type" : "list",
                      "subNodes" : [
                        {
                          "type" : "reference",
                          "name" : "profiles",
                          "id" : 2
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        },
        "outVariable" : {
          "id" : 4,
          "name" : "4"
        },
        "canThrow" : false
      },
      {
        "type" : "ReturnNode",
        "dependencies" : [
          6
        ],
        "id" : 7,
        "estimatedCost" : 26551,
        "estimatedNrItems" : 5310,
        "inVariable" : {
          "id" : 4,
          "name" : "4"
        }
      }
    ],
    "rules" : [
      "use-index-for-sort"
    ],
    "collections" : [
      {
        "name" : "LRQ",
        "type" : "read"
      }
    ],
    "variables" : [
      {
        "id" : 0,
        "name" : "q"
      },
      {
        "id" : 1,
        "name" : "profile"
      },
      {
        "id" : 4,
        "name" : "4"
      },
      {
        "id" : 2,
        "name" : "profiles"
      },
      {
        "id" : 3,
        "name" : "3"
      }
    ],
    "estimatedCost" : 26551,
    "estimatedNrItems" : 5310
  },
  "warnings" : [ ],
  "stats" : {
    "rulesExecuted" : 25,
    "rulesSkipped" : 0,
    "plansCreated" : 1
  }
}

hm, looking at the explain there is a sortnode, while your query doesn't provide a sort? 嗯,看看说明中有一个sortnode,而您的查询没有提供排序? the collect probably keeps the optimizer from employing your index (you would have an IndexRangeNode instead of a EnumerateCollectionNode then) 收集可能会使优化器无法使用索引(然后您将拥有一个IndexRangeNode而不是EnumerateCollectionNode)

If you pass the options parameter of the query (4th parameter of db._query()) { profile : true } it will output the time used by the phases; 如果传递查询的options参数(db._query()的第4个参数){profile:true},它将输出各阶段使用的时间; can you re-run your query with that, and show us the reply? 您可以重新运行您的查询,然后向我们显示回复吗?

The COLLECT statement requires sorted input. COLLECT语句需要排序的输入。 Therefore, a SORT statement will be added to the execution plan automatically, even if the original query string does not contain an explicit SORT statement. 因此,即使原始查询字符串不包含显式SORT语句,也会将SORT语句自动添加到执行计划中。

This is why a SortNode appeared in the plan. 这就是为什么SortNode出现在计划中的原因。 The SortNode will be optimized away if there is a skiplist index on the sort attribute (in this case LongRunningQuery.Profile ). 如果sort属性上有一个跳过列表索引(在本例中为LongRunningQuery.Profile ),将优化LongRunningQuery.Profile So adding a skiplist index on the attribute will speed it up as the (expensive) sort step can be spared. 因此,在属性上添加一个skiplist索引可以加快速度,因为可以节省(昂贵)的排序步骤。

If you have set up such index and run the query, it should be faster than when there is only a hash index. 如果已经设置了这样的索引并运行查询,则它应该比仅存在哈希索引时要快。 In fact, the original query should have ignored the hash index. 实际上,原始查询应该已经忽略了哈希索引。

If you have set up the skiplist index and explain the query, you should also see that there is no SortNode anymore. 如果已经设置了跳过列表索引并解释了该查询,则还应该看到不再有SortNode。

Starting with ArangoDB 2.4 (currently in devel stage), there is a more efficient syntax addition for just counting facets: 从ArangoDB 2.4(目前处于开发阶段)开始,添加了仅用于计算方面的更有效的语法:

FOR q IN LRQ  
  COLLECT profile = q.LongRunningQuery.Profile WITH COUNT INTO numProfiles
  RETURN { "Profile" : profile, "Count" : numProfiles } 

This should speed up the query even more. 这样可以进一步加快查询速度。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM