简体   繁体   English

在xml标记值上对ElasticSearch文档进行分组(在字符串字段中)

[英]Grouping ElasticSearch documents on xml tag value (in a string field)

I have this kind of documents on my ElasticSearch index : 我在ElasticSearch索引上有这样的文档:

{
    "took" : 31,
    "timed_out" : false,
    "_shards" : {
        "total" : 68,
        "successful" : 68,
        "failed" : 0
    },
    "hits" : {
        "total" : 9103,
        "max_score" : 8.823501,
        "hits" : [{
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:17</timeStamp><step>1</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }, {
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:17</timeStamp><step>2</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }, {
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:18</timeStamp><step>3</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }, {
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:26</timeStamp><step>1</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }, {
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>456</requestId><timeStamp>2016-26-08T09:37:27</timeStamp><step>2</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }, {
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>456</requestId><timeStamp>2016-26-08T09:37:27</timeStamp><step>3</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }, {
                "_index" : "ESB",
                "_type" : "MDOrderFO",
                "_id" : "AVaxDzEGBclOg4W8YiW1",
                "_score" : 8.823501,
                "_source" : {
                    "message" : "<root><flux>MyFlux</flux><requestId>456</requestId><timeStamp>2016-26-08T09:37:17</timeStamp><step>2</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                    "timestamp" : "2016-08-22T07:02:57.085Z",
                    "logger_name" : "MDOrderFOToFO"
                }
            }
        ]
    }
}

Here is the XML format of the message field: 以下是消息字段的XML格式:

<root>
    <flux>MyFlux</flux>
    <requestId>123</requestId>
    <timeStamp>2016-26-08T09:37:17</timeStamp>
    <step>2</step>
    <status>ok</status>
    <body><xml><myobject><field1>value1</field1></myobject></xml></body>
</root>

I'd like to build a query that could group my documents on the RequestId value (which is in the XML content of the message field). 我想构建一个查询,可以将我的文档分组到RequestId值(在消息字段的XML内容中)。 I expect this kind of answer : 我期待这样的答案:

{
    "took" : 31,
    "timed_out" : false,
    "_shards" : {
        "total" : 68,
        "successful" : 68,
        "failed" : 0
    },
    "hits" : {
        "total" : 9103,
        "max_score" : 8.823501,
        "hits" : [...],
        "aggregations" : {
            "myaggs" : {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets" : [{
                        "key" : "123",
                        "documents" : [{
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:17</timeStamp><step>1</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }, {
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:17</timeStamp><step>2</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }, {
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:18</timeStamp><step>3</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }
                        ]
                    }, {
                        "key" : "456",
                        "documents" : [{
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>123</requestId><timeStamp>2016-26-08T09:37:26</timeStamp><step>1</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }, {
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>456</requestId><timeStamp>2016-26-08T09:37:27</timeStamp><step>2</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }, {
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>456</requestId><timeStamp>2016-26-08T09:37:27</timeStamp><step>3</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }, {
                                "_index" : "ESB",
                                "_type" : "MDOrderFO",
                                "_id" : "AVaxDzEGBclOg4W8YiW1",
                                "_score" : 8.823501,
                                "_source" : {
                                    "message" : "<root><flux>MyFlux</flux><requestId>456</requestId><timeStamp>2016-26-08T09:37:17</timeStamp><step>2</step><status>ok</status><body><xml><myobject><field1>value1</field1></myobject></xml></body></root>",
                                    "timestamp" : "2016-08-22T07:02:57.085Z",
                                    "logger_name" : "MDOrderFOToFO"
                                }
                            }
                        ]
                    }
                ]
            }
        }
    }
}

I'm very new with ElasticSearch and I spent a week on this… and at this time, I don't even know if this is possible. 我对ElasticSearch很新,我花了一个星期的时间......而且此时,我甚至都不知道这是否可行。

I really hope you'll be able to help me. 我真的希望你能够帮助我。 Thank you in advance. 先感谢您。

And of course, as a French speaker, sorry for my English 当然,作为法语发言人,对不起我的英语

EDIT 编辑
- Unfortunatly I can't edit the mapping. - 不幸的是我无法编辑映射。 I don't have access to the part of the process which is saving logs into ES 我无权访问将日志保存到ES的过程部分
- Actually, the formats i gave are quiet simplified against reality. - 实际上,我给出的格式是对现实的安静简化。 There plenty of other technical information logged at the mapping level and in the XML content. 在映射级别和XML内容中记录了大量其他技术信息。 The context : The BUS application that pushes logs into ES has 3 steps (1: receiving, 2: routing, 3: sending). 上下文:将日志推送到ES的BUS应用程序有3个步骤(1:接收,2:路由,3:发送)。 It logs information about the state of a request (ok, fail) and the object that is transiting in this request. 它记录有关请求状态(ok,fail)和在此请求中转换的对象的信息。 The purpose of the application I'm working on is to display business informations about all the requests that have transited is the BUS application for a date range. 我正在处理的应用程序的目的是显示有关已转换的所有请求的业务信息是日期范围的BUS应用程序。
So in my query, I want to : 所以在我的查询中,我想:
1. Aggregate my logs by RequestId (each group should contains 1 log at receiving step, 0 or 1 log at routing step and 0 or 1 log at sending step) 1.通过RequestId聚合我的日志(每个组在接收步骤中应包含1个日志,在路由步骤中应包含0或1个日志,在发送步骤时应包含0或1个日志)
2. Filter the resulting groups on the date of the log at the receiving step 2.在接收步骤的日志日期过滤生成的组
3. Take the first 10 groups ordered by date descending 3.按照日期降序排序前10组

One way to achieve that is by modifying your database schema. 实现这一目标的一种方法是修改数据库模式。 Since your xml schema is fixed you can store each xml node in a separate filed in Elastic instead of storing entire xml in a single field. 由于修复了xml架构,因此可以将每个xml节点存储在Elastic中的单独字段中,而不是将整个xml存储在单个字段中。 For example flux , requestId , timeStamp etc will be mapped to separate filed (may be of same name) in Elastic. 例如, fluxrequestIdtimeStamp等将映射到Elastic中的单独timeStamp (可能是同名)。

I am not 100% sure what you want to achieve here, so I'll try to point out some things you could consider and/or try: 我不是100%肯定你想在这里实现什么,所以我会试着指出一些你可以考虑和/或尝试的事情:

The way you store data in your ES index is not very query friendly, no matter what you are trying to achieve. 无论您想要实现什么,在ES索引中存储数据的方式都不是非常友好的查询。 I would suggest breaking your XML documents and storing each attribute in separate fields, like this: 我建议打破你的XML文档并将每个属性存储在不同的字段中,如下所示:

"_source" : {
    "flux": "My Flux",
    "requestId": 123,
    "xml_timeStamp": "2016-26-08T09:37:17",
    "step": 1,
    "status": "ok",
    "field1": "value1",
    "timestamp" : "2016-08-22T07:02:57.085Z",
    "logger_name" : "MDOrderFOToFO"
}

This way of storing your data would mean you'd only need to use a value count aggregation for your aggregation. 这种存储数据的方式意味着您只需要为聚合使用值计数聚合。

In order to achieve this, you'd probably need a method that would break your XML documents and map it to this new ElasticSearch mapping. 为了实现这一点,您可能需要一种方法来破坏XML文档并将其映射到这个新的ElasticSearch映射。

In this case, your aggregation query would look something similar to: 在这种情况下,您的聚合查询看起来类似于:

{
    "aggs" : {
        "myaggs" : {
            "avg_price" : { "value_count" : { "field" : "requestId" } }
        }
    }
}

If it's impossible for you to update your index mapping, I would suggest looking into the regex filtering and include that in an aggregation query. 如果您无法更新索引映射,我建议您查看正则表达式过滤并将其包含在聚合查询中。

Either way, those aggregations won't return you the documents inside each bucket. 无论哪种方式,这些aggregations都不会返回每个桶中的文档。 There is no good use case to want to return all documents in Elasticsearch or any other type of database. 没有好用例想要返回Elasticsearch或任何其他类型的数据库中的所有文档。 It would be a very memory intensive operation and, also, slow. 这将是一个非常耗费内存的操作,而且速度很慢。

If you'd like your documents to be returned ordered by the requestId , then consider changing the index mapping to the one I suggested above. 如果您希望requestId按顺序返回您的文档,请考虑将索引映射更改为我上面建议的那个。 Then use sort to return your data. 然后使用sort返回您的数据。

Let me know if this helps :) 如果这有帮助,请告诉我:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM