如何改善arangodb中的更新查询

Question

I have a collection which holds more than 15 million documents. 我的收藏馆藏有超过1500万份文档。 Out of those 15 million documents I update 20k records every hour. 在这1500万份文档中，我每小时更新2万条记录。 But update query takes a long time to finish (30 min around). 但是更新查询需要很长时间才能完成（大约30分钟）。

Document: 文献：

{ "inst" : "instance1", "dt": "2015-12-12T00:00:000Z", "count": 10} {“ inst”：“ instance1”，“ dt”：“ 2015-12-12T00：00：000Z”，“ count”：10}

I have an array which holds 20k instances to be updated. 我有一个包含20k实例要更新的数组。

My Query looks like this: 我的查询如下所示：

For h in hourly filter h.dt == DATE_ISO8601(14501160000000) 
   For i in instArr
      filter i.inst == h.inst
      update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly

Is there any optimized way of doing this. 有没有优化的方法来做到这一点。 I have hash indexing on inst and skiplist indexing on dt. 我在inst上有哈希索引，在dt上有skiplist索引。

Update 更新资料

I could not use 20k inst in the query manually so following is the execution plan for just 2 inst: 我无法在查询中手动使用20k inst，因此以下是仅2 inst的执行计划：

 FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN [{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst == r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg, "samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst Execution plan: Id NodeType Est. Comment 1 SingletonNode 1 * ROOT 5 CalculationNode 1 - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ] /* json expression */ /* const assignment */ 13 IndexRangeNode 103067 - FOR r IN hourly /* skiplist index scan */ 6 EnumerateListNode 206134 - FOR i IN #6 /* list iteration */ 7 CalculationNode 206134 - LET #8 = i.`inst` == r.`inst` /* simple expression */ /* collections used: r : hourly */ 8 FilterNode 206134 - FILTER #8 9 CalculationNode 206134 - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` } /* simple expression */ 10 UpdateNode 206134 - UPDATE r WITH #10 IN hourly 11 CalculationNode 206134 - LET #12 = $NEW.`inst` /* attribute expression */ 12 ReturnNode 206134 - RETURN #12 Indexes used: Id Type Collection Unique Sparse Selectivity Est. Fields Ranges 13 skiplist hourly false false n/a `dt` [ `dt` == "2015-12-14T18:00:00.000Z" ] Optimization rules applied: Id RuleName 1 move-calculations-up 2 move-filters-up 3 move-calculations-up-2 4 move-filters-up-2 5 remove-data-modification-out-variables 6 use-index-range 7 remove-filter-covered-by-index Write query options: Option Value ignoreErrors true waitForSync false nullMeansRemove false mergeObjects true ignoreDocumentNotFound false readCompleteInput true

Answer 1

I assume the selection part (not the update part) will be the bottleneck in this query. 我假设选择部分（而不是更新部分）将成为此查询的瓶颈。

The query seems problematic because for each document matching the first filter ( h.dt == DATE_ISO8601(...) ), there will be an iteration over the 20,000 values in the instArr array. 该查询似乎有问题，因为对于每个与第一个过滤器匹配的文档（ h.dt == DATE_ISO8601(...) ），将在instArr数组中的20,000个值上进行迭代。 If instArr values are unique, then only one value from it will match. 如果instArr值是唯一的，则将仅匹配其中的一个值。 Additionally, no index will be used for the inner loop, as the index selection has happened in the outer loop already. 另外，没有索引将用于内部循环，因为索引选择已在外部循环中发生。

Instead of looping over all values in instArr , it will be better to turn the accompanying == comparison into an IN comparison. 与其遍历instArr所有值， instArr将附带的==比较变成IN比较。 That would already work if instArr would be an array of instance names, but it seems to be an array of instance objects (consisting of at least attributes inst and count ). 如果instArr是实例名称的数组，这已经可以工作，但是它似乎是实例对象的数组（至少由inst和count属性组成）。 In order to use the instance names in an IN comparison, it would be better to have a dedicated array of instance names, and a translation table for the count and dt values. 为了在IN比较中使用实例名称，最好有一个专用的实例名称数组，以及一个用于count和dt值的转换表。

Following is an example for generating these with JavaScript: 以下是使用JavaScript生成这些代码的示例：

var instArr = [ ], trans = { }; 
for (i = 0; i < 20000; ++i) { 
  var instance = "instance" + i;
  var count = Math.floor(Math.random() * 10);
  var dt = (new Date(Date.now() - Math.floor(Math.random() * 10000))).toISOString();
  instArr.push(instance);        
  trans[instance] = [ count, dt ];  
}

instArr would then look like this: instArr将如下所示：

[ "instance0", "instance1", "instance2", ... ]

and trans : 和trans ：

{ 
  "instance0" : [ 4, "2015-12-16T21:24:45.106Z" ], 
  "instance1" : [ 0, "2015-12-16T21:24:39.881Z" ],
  "instance2" : [ 2, "2015-12-16T21:25:47.915Z" ],
  ...
}

These data can then be injected into the query using bind variables (named like the variables above): 然后可以使用绑定变量（如上面的变量那样命名）将这些数据注入查询中：

FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN @trans[h.inst]

Note that ArangoDB 2.5 does not yet support the @trans[h.inst] syntax. 请注意，ArangoDB 2.5尚不支持@trans[h.inst]语法。 In that version, you will need to write: 在该版本中，您将需要编写：

LET trans = @trans
FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN trans[h.inst]

Additionally, 2.5 has a problem with longer IN lists. 另外，2.5的IN列表较长。 IN-list performance decreases quadratically with the length of the IN list. IN列表的性能随IN列表的长度呈二次方下降。 So in this version, it will make sense to limit the length of instArr to at most 2,000 values. 因此，在此版本中，将instArr为最多2,000个值是instArr 。 That may require issuing multiple queries with smaller IN lists instead of just one with a big IN list. 这可能需要发出多个具有较小IN列表的查询，而不是仅发出具有大IN列表的查询。

The better alternative would be to use ArangoDB 2.6, 2.7 or 2.8, which do not have that problem, and thus do not require the workaround. 更好的选择是使用ArangoDB 2.6、2.7或2.8，它们没有此问题，因此不需要解决方法。 Apart from that, you can get away with the slightly shorter version of the query in the newer ArangoDB versions. 除此之外，您还可以在较新的ArangoDB版本中使用稍短的查询版本。

Also note that in all of the above examples I used a RETURN ... instead of the UPDATE statement from the original query. 还要注意，在以上所有示例中，我都使用RETURN ...而不是原始查询中的UPDATE语句。 This is because all my tests revealed that the selection part of the query is the major problem, at least with the data I had generated. 这是因为我所有的测试都表明查询的选择部分是主要问题，至少与我生成的数据有关。 A final note on the original version of the UPDATE : updating each document's inst value with i.inst seems redudant, because i.inst == h.inst so the value won't change. 关于UPDATE原始版本的最后说明：用i.inst更新每个文档的inst值似乎是i.inst == h.inst ，因为i.inst == h.inst使得该值不会更改。

如何改善arangodb中的更新查询

问题描述

1 个解决方案

解决方案1
4 已采纳 2015-12-16 21:29:05

如何改善arangodb中的更新查询

问题描述

1 个解决方案

解决方案1 4 已采纳 2015-12-16 21:29:05

解决方案1
4 已采纳 2015-12-16 21:29:05