如何改进arangodb中的更新查询(How to improve Update query in arangodb)

编程入门行业动态更新时间:2024-10-12 08:27:17

我有一个收藏超过1500万份文件的藏品。在这1500万份文件中，我每小时更新20k条记录。但更新查询需要很长时间才能完成（周围30分钟）。

文件：

{“inst”：“instance1”，“dt”：“2015-12-12T00：00：000Z”，“count”：10}

我有一个数组，可以容纳20k个实例进行更新。

我的查询如下所示：

For h in hourly filter h.dt == DATE_ISO8601(14501160000000) For i in instArr filter i.inst == h.inst update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly

有没有优化的方法来做到这一点。我对inst上的哈希索引和dt上的skiplist索引。

更新

我无法在查询中手动使用20k inst，因此以下是仅仅2 inst的执行计划：

FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN [{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst == r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg, "samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst Execution plan: Id NodeType Est. Comment 1 SingletonNode 1 * ROOT 5 CalculationNode 1 - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ] /* json expression */ /* const assignment */ 13 IndexRangeNode 103067 - FOR r IN hourly /* skiplist index scan */ 6 EnumerateListNode 206134 - FOR i IN #6 /* list iteration */ 7 CalculationNode 206134 - LET #8 = i.`inst` == r.`inst` /* simple expression */ /* collections used: r : hourly */ 8 FilterNode 206134 - FILTER #8 9 CalculationNode 206134 - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` } /* simple expression */ 10 UpdateNode 206134 - UPDATE r WITH #10 IN hourly 11 CalculationNode 206134 - LET #12 = $NEW.`inst` /* attribute expression */ 12 ReturnNode 206134 - RETURN #12 Indexes used: Id Type Collection Unique Sparse Selectivity Est. Fields Ranges 13 skiplist hourly false false n/a `dt` [ `dt` == "2015-12-14T18:00:00.000Z" ] Optimization rules applied: Id RuleName 1 move-calculations-up 2 move-filters-up 3 move-calculations-up-2 4 move-filters-up-2 5 remove-data-modification-out-variables 6 use-index-range 7 remove-filter-covered-by-index Write query options: Option Value ignoreErrors true waitForSync false nullMeansRemove false mergeObjects true ignoreDocumentNotFound false readCompleteInput true

I have a collection which holds more than 15 million documents. Out of those 15 million documents I update 20k records every hour. But update query takes a long time to finish (30 min around).

Document:

{ "inst" : "instance1", "dt": "2015-12-12T00:00:000Z", "count": 10}

I have an array which holds 20k instances to be updated.

My Query looks like this:

For h in hourly filter h.dt == DATE_ISO8601(14501160000000) For i in instArr filter i.inst == h.inst update h with {"inst":i.inst, "dt":i.dt, "count":i.count} in hourly

Is there any optimized way of doing this. I have hash indexing on inst and skiplist indexing on dt.

Update

I could not use 20k inst in the query manually so following is the execution plan for just 2 inst:

FOR r in hourly FILTER r.dt == DATE_ISO8601(1450116000000) FOR i IN 
 [{"inst":"0e649fa22bcc5200d7c40f3505da153b", "dt":"2015-12-14T18:00:00.000Z"}, {}] FILTER i.inst == 
 r.inst UPDATE r with {"inst":i.inst, "dt": i.dt, "max":i.max, "min":i.min, "sum":i.sum, "avg":i.avg, 
 "samples":i.samples} in hourly OPTIONS { ignoreErrors: true } RETURN NEW.inst

Execution plan:
 Id   NodeType              Est.   Comment
  1   SingletonNode            1   * ROOT
  5   CalculationNode          1     - LET #6 = [ { "inst" : "0e649fa22bcc5200d7c40f3505da153b", "dt" : "2015-12-14T18:00:00.000Z" }, { } ]   /* json expression */   /* const assignment */
 13   IndexRangeNode      103067     - FOR r IN hourly   /* skiplist index scan */
  6   EnumerateListNode   206134       - FOR i IN #6   /* list iteration */
  7   CalculationNode     206134         - LET #8 = i.`inst` == r.`inst`   /* simple expression */   /* collections used: r : hourly */
  8   FilterNode          206134         - FILTER #8
  9   CalculationNode     206134         - LET #10 = { "inst" : i.`inst`, "dt" : i.`dt`, "max" : i.`max`, "min" : i.`min`, "sum" : i.`sum`, "avg" : i.`avg`, "samples" : i.`samples` }   /* simple expression */
 10   UpdateNode          206134         - UPDATE r WITH #10 IN hourly
 11   CalculationNode     206134         - LET #12 = $NEW.`inst`   /* attribute expression */
 12   ReturnNode          206134         - RETURN #12

Indexes used:
 Id   Type       Collection   Unique   Sparse   Selectivity Est.   Fields   Ranges
 13   skiplist   hourly       false    false                 n/a   `dt`     [ `dt` == "2015-12-14T18:00:00.000Z" ]

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   remove-data-modification-out-variables
  6   use-index-range
  7   remove-filter-covered-by-index

Write query options:
 Option                   Value
 ignoreErrors             true
 waitForSync              false
 nullMeansRemove          false
 mergeObjects             true
 ignoreDocumentNotFound   false
 readCompleteInput        true 
  
 

                最满意答案
                
                    
                         我假设选择部分（而不是更新部分）将成为此查询的瓶颈。  
 查询似乎有问题，因为对于匹配第一个过滤器的每个文档（ h.dt == DATE_ISO8601(...) ），将在instArr数组中对20,000个值进行迭代。 如果instArr值是唯一的，则只有一个值匹配。 此外，内部循环不会使用索引，因为索引选择已经在外部循环中发生。  
 而不是在instArr中循环遍历所有值，最好将伴随的==比较转换为IN比较。 如果instArr是一个实例名称数组，那将会有效，但它似乎是一个实例对象数组（至少由属性inst和count ）。 为了在IN比较中使用实例名称，最好有一个实例名称的专用数组，以及count和dt值的转换表。  
 以下是使用JavaScript生成这些内容的示例：  
var instArr = [ ], trans = { }; 
for (i = 0; i < 20000; ++i) { 
  var instance = "instance" + i;
  var count = Math.floor(Math.random() * 10);
  var dt = (new Date(Date.now() - Math.floor(Math.random() * 10000))).toISOString();
  instArr.push(instance);        
  trans[instance] = [ count, dt ];  
} 
 
 然后instArr看起来像这样：  
[ "instance0", "instance1", "instance2", ... ]
 
 和trans ：  
{ 
  "instance0" : [ 4, "2015-12-16T21:24:45.106Z" ], 
  "instance1" : [ 0, "2015-12-16T21:24:39.881Z" ],
  "instance2" : [ 2, "2015-12-16T21:25:47.915Z" ],
  ...
}
 
 然后可以使用绑定变量将这些数据注入到查询中（命名类似于上面的变量）：  
FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN @trans[h.inst]
 
 请注意，ArangoDB 2.5尚不支持@trans[h.inst]语法。 在该版本中，您需要编写：  
LET trans = @trans
FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN trans[h.inst]
 
 此外，2.5有一个问题，更长的IN列表。 列表内性能随着IN列表的长度呈二次方式降低。 所以在这个版本中，将instArr为最多2,000个值是instArr 。 这可能需要发布具有较小IN列表的多个查询，而不是仅具有大IN列表的查询。  
 更好的选择是使用ArangoDB 2.6,2.7或2.8，它们没有这个问题，因此不需要解决方法。 除此之外，您可以在较新的ArangoDB版本中使用稍短版本的查询。  
 另请注意，在上面的所有示例中，我使用了RETURN ...而不是原始查询中的UPDATE语句。 这是因为我的所有测试都显示查询的选择部分是主要问题，至少对于我生成的数据。 关于UPDATE原始版本的最后说明：使用i.inst更新每个文档的inst值似乎是redudant，因为i.inst == h.inst所以值不会改变。 
I assume the selection part (not the update part) will be the bottleneck in this query.  
The query seems problematic because for each document matching the first filter (h.dt == DATE_ISO8601(...)), there will be an iteration over the 20,000 values in the instArr array. If instArr values are unique, then only one value from it will match. Additionally, no index will be used for the inner loop, as the index selection has happened in the outer loop already. 
Instead of looping over all values in instArr, it will be better to turn the accompanying == comparison into an IN comparison. That would already work if instArr would be an array of instance names, but it seems to be an array of instance objects (consisting of at least attributes inst and count). In order to use the instance names in an IN comparison, it would be better to have a dedicated array of instance names, and a translation table for the count and dt values. 
Following is an example for generating these with JavaScript: 
var instArr = [ ], trans = { }; 
for (i = 0; i < 20000; ++i) { 
  var instance = "instance" + i;
  var count = Math.floor(Math.random() * 10);
  var dt = (new Date(Date.now() - Math.floor(Math.random() * 10000))).toISOString();
  instArr.push(instance);        
  trans[instance] = [ count, dt ];  
} 
 
instArr would then look like this: 
[ "instance0", "instance1", "instance2", ... ]
 
and trans: 
{ 
  "instance0" : [ 4, "2015-12-16T21:24:45.106Z" ], 
  "instance1" : [ 0, "2015-12-16T21:24:39.881Z" ],
  "instance2" : [ 2, "2015-12-16T21:25:47.915Z" ],
  ...
}
 
These data can then be injected into the query using bind variables (named like the variables above): 
FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN @trans[h.inst]
 
Note that ArangoDB 2.5 does not yet support the @trans[h.inst] syntax. In that version, you will need to write:  
LET trans = @trans
FOR h IN hourly 
  FILTER h.dt == DATE_ISO8601(1450116000000) 
  FILTER h.inst IN @instArr 
  RETURN trans[h.inst]
 
Additionally, 2.5 has a problem with longer IN lists. IN-list performance decreases quadratically with the length of the IN list. So in this version, it will make sense to limit the length of instArr to at most 2,000 values. That may require issuing multiple queries with smaller IN lists instead of just one with a big IN list. 
The better alternative would be to use ArangoDB 2.6, 2.7 or 2.8, which do not have that problem, and thus do not require the workaround. Apart from that, you can get away with the slightly shorter version of the query in the newer ArangoDB versions. 
Also note that in all of the above examples I used a RETURN ... instead of the UPDATE statement from the original query. This is because all my tests revealed that the selection part of the query is the major problem, at least with the data I had generated. A final note on the original version of the UPDATE: updating each document's inst value with i.inst seems redudant, because i.inst == h.inst so the value won't change.