简体   繁体   English

Elasticsearch聚合,将文档字段与多个存储桶的存储桶键进行比较

[英]Elasticsearch aggregation that compares document field to bucket key for multiple buckets

I have an index of orders. 我有一个订单索引。 Each document in the index contains a date that the order was completed. 索引中的每个文档都包含订单完成的日期。 I am trying to build an aggregation that gives me the historical work in progress (wip) for a date histogram aggregation. 我正在尝试建立一个聚合,该聚合为我提供日期直方图聚合的正在进行的历史工作(wip)。 The wip is calculated by comparing the completed date with each date in the date histogram. 通过将完成日期与日期直方图中的每个日期进行比较,可以计算出wip。 If the completed date is > current bucket date then the order is considered in progress and should be included in the bucket. 如果完成日期>当前时段日期,则该订单被视为正在进行中,应包含在时段中。

From my research the best I can determine is that a date_histogram using a value script would give me the results I need. 从我的研究中,我可以确定的是,使用值脚本的date_histogram会给我所需的结果。 However I can't figure out how to structure my script. 但是我不知道如何构造脚本。

Currently my query looks like this: 目前,我的查询如下所示:

{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "wip": {
            "date_histogram": {
                "field": "com_ord_created_ddate",
                "script": "doc['com_ord_completed_ddate'] > _value",
                "interval": "day",
                "format": "yyyy-MM-dd"
            }
        }
    }
}

This query returns the following exception 该查询返回以下异常

{
  "error": {
    "root_cause": [
      {
        "type": "script_exception",
        "reason": "failed to run inline script [doc['com_ord_completed_ddate'] > _value] using lang [groovy]"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "orders",
        "node": "eYYqpuNSQ0KOt04JEztTDg",
        "reason": {
          "type": "script_exception",
          "reason": "failed to run inline script [doc['com_ord_completed_ddate'] > _value] using lang [groovy]",
          "caused_by": {
            "type": "groovy_runtime_exception",
            "reason": "Cannot compare org.elasticsearch.index.fielddata.ScriptDocValues$Longs with value '[1445864743000]' and java.lang.Long with value '1,445,618,646,000'"
          }
        }
      }
    ]
  },
  "status": 500
}

I know script is written poorly. 我知道脚本写得​​不好。 I have not been able to find documentation that clearly outlines available variables inside the script scope. 我找不到能够清楚概述脚本作用域内可用变量的文档。 I got _value from a qbox tutorial here https://qbox.io/blog/elasticsearch-scripting-aggregations But not enough is said about what _value is and what other variables are available for operating on. 我在这里的qbox教程中获得了_value, 网址为https://qbox.io/blog/elasticsearch-scripting-aggregations,但是关于_value是什么以及可以使用哪些其他变量的说法还不够。

Can anyone point me to clear documentation on aggregation inline value scripting or can they help me by providing a script that would get the results that I need? 谁能指出我清除有关聚合内联值脚本的文档,或者他们可以通过提供可以获取所需结果的脚本来帮助我吗?

UPDATE I was able to get the first part of my question using this: 更新我可以使用以下方法获得问题的第一部分:

{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "wip": {
            "date_histogram": {
                "field": "com_ord_created_ddate",
                "script": "if(_value < doc['com_ord_completed_ddate'].value) {_value} else {0}",
                "interval": "day",
                "format": "yyyy-MM-dd"
            }
        }
    }
}

However the script is limited to comparing documents already aggregated to the bucket. 但是,脚本仅限于比较已经聚合到存储桶中的文档。 I need to compare all documents in the result for every bucket. 我需要比较每个存储桶结果中的所有文档。 Any thoughts? 有什么想法吗?

Please try the following query 请尝试以下查询

{
"query": {
    "match_all": {}
},
"aggs": {
    "wip": {
        "date_histogram": {
            "field": "com_ord_created_ddate",
            "script": "doc['com_ord_completed_ddate'].value > _value",
            "interval": "day",
            "format": "yyyy-MM-dd"
        }
    }
}

Note the .value after doc['com_ord_completed_ddate'] 请注意doc['com_ord_completed_ddate']之后的.value

Documentation for Script aggregation 脚本汇总文档

I was really close to getting what I needed with a scripted_metric aggregation that looked like this. 看起来像这样的scripted_metric聚合真的可以满足我的需要。

"aggs": {
    "wip": {
        "scripted_metric": {
            "init_script": "_agg['created_dates'] = []; _agg['documents'] = []",
            "map_script": "_agg.created_dates.add(doc['com_ord_created_ddate'].value); document = [:]; document.created_date = doc['com_ord_created_ddate']; document.completed_date = doc['com_ord_completed_ddate']; _agg.documents.add(document); return _agg;",
            "combine_script": "_agg.created_dates.unique()",
            "reduce_script": "results = []; wip = [:];for (agg in _aggs) { for (d in agg.created_dates) {wip.key = d; wip.doc_count = 0; for(o in agg.documents) { if (d < o.completed_date && d >= o.created_date) { wip.doc_count++ } }; results.add(wip); }; }; return results;"
        }
    }
}

Unfortunately the script is poorly optimized and the query was taking 10 - 15 seconds, which for ES is an eternity and did not meet performance requirements for the project. 不幸的是,脚本的优化效果很差,查询花费了10到15秒,这对于ES来说是永恒的,并且不满足项目的性能要求。

In the end I was able to get the results I needed by making multiple ES queries. 最后,通过进行多个ES查询,我可以获得所需的结果。 I first queried the above mentioned date_histogram aggregation without the script parameter. 我首先查询了上述没有脚本参数的date_histogram聚合。 Then serverside I looped over each date that was returned in the aggregation and performed another query to get my historical wip for that specific day. 然后在服务器端,我遍历了聚合中返回的每个日期,并执行了另一个查询以获取该特定日期的历史记录。 Its not pretty but surprisingly it was quicker than the scripted_metric. 它虽然不漂亮,但令人惊讶的是它比scripted_metric更快。 Total load time for the end user is about 2-3 seconds which is an acceptable time frame in my use case. 最终用户的总加载时间约为2-3秒,这在我的用例中是可以接受的时间范围。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM