简体   繁体   English

如何使用 Elasticsearch 按 @timestamp 对分页日志进行排序?

[英]How to sort paginated logs by @timestamp with Elasticsearch?

My goal is to sort millions of logs by timestamp that I receive out of Elasticsearch.我的目标是根据从 Elasticsearch 中收到的时间戳对数百万条日志进行排序。

Example logs:示例日志:

{"realIp": "192.168.0.2", "@timestamp": "2020-12-06T02:00:09.000Z"}
{"realIp": "192.168.0.2", "@timestamp": "2020-12-06T02:01:09.000Z"}
{"realIp": "192.168.0.2", "@timestamp": "2020-12-06T02:02:09.000Z"}
{"realIp": "192.168.0.2", "@timestamp": "2020-12-06T02:04:09.000Z"}

Unfortunately, I am not able to get all the logs sorted out of Elastic.不幸的是,我无法从 Elastic 中整理出所有日志。 It seems like I have to do it by myself.看来我得自己动手了。

Approaches I have tried to get the data sorted out of elastic:我尝试将数据从弹性中整理出来的方法:

es = Search(index="somelogs-*").using(client).params(preserve_order=True)
for hit in es.scan():
    print(hit['@timestamp'])

Another approach:另一种方法:

notifications = (es
    .query("range", **{
        "@timestamp": {
            'gte': 'now-48h',
            'lt' : 'now'
        }
    })
    .sort("@timestamp")
    .scan()
)

So I am looking for a way to sort these logs by myself or directly through Elasticsearch. Currently, I am saving all the data in a local 'logs.json' and it seems to me I have to iter over and sort it by myself.所以我正在寻找一种方法来自己或直接通过 Elasticsearch 对这些日志进行排序。目前,我将所有数据保存在本地“logs.json”中,在我看来我必须自己迭代并对其进行排序。

You should definitely let Elasticsearch do the sorting, then return the data to you already sorted.你绝对应该让 Elasticsearch 做排序,然后将数据返回给你已经排序。

The problem is that you are using .scan() .问题是您正在使用.scan() It uses Elasticsearch's scan/scroll API, which unfortunately only applies the sorting params on each page/slice, not the entire search result.它使用 Elasticsearch 的扫描/滚动 API,不幸的是它只在每个页面/切片上应用排序参数,而不是整个搜索结果。 This is noted in the elasticsearch-dsl docs on Pagination :在分页的 elasticsearch-dsl 文档中有说明:

Pagination分页

... ...
If you want to access all the documents matched by your query you can use the scan method which uses the scan/scroll elasticsearch API:如果您想访问与您的查询匹配的所有文档,您可以使用使用扫描/滚动 elasticsearch API 的扫描方法:

 for hit in s.scan(): print(hit.title)

Note that in this case the results won't be sorted.请注意,在这种情况下,不会对结果进行排序。

(emphasis mine) (强调我的)

Using pagination is definitely an option especially when you have a " millions of logs " as you said.使用分页绝对是一种选择,尤其是当您如您所说的那样拥有“数百万条日志”时。 There is a search_after pagination API :有一个search_after分页 API

Search after搜索之后

You can use the search_after parameter to retrieve the next page of hits using a set of sort values from the previous page.您可以使用search_after参数使用上一页中的一组排序值检索下一页的命中。
... ...
To get the first page of results, submit a search request with a sort argument.要获得第一页结果,请提交带有sort参数的搜索请求。
... ...
The search response includes an array of sort values for each hit.搜索响应包括每个命中的sort值数组。
... ...
To get the next page of results, rerun the previous search using the last hit's sort values as the search_after argument.要获得下一页结果,请使用最后一次命中的sort值作为search_after参数重新运行之前的搜索。 ... The search's query and sort arguments must remain unchanged. ... 搜索的querysort arguments 必须保持不变。 If provided, the from argument must be 0 (default) or -1 .如果提供,则from参数必须为0 (默认值)或-1
... ...
You can repeat this process to get additional pages of results.您可以重复此过程以获得更多结果页。

(omitted the raw JSON requests since I'll show a sample in Python below) (省略了原始的 JSON 请求,因为我将在下面的 Python 中展示一个示例)

Here's a sample how to do it with elasticsearch-dsl for Python .这是一个示例,说明如何使用elasticsearch-dsl 为 Python 执行此操作 Note that I'm limiting the fields and the number of results to make it easier to test.请注意,我限制了fields和结果的数量,以便于测试。 The important parts here are the sort and the extra(search_after=) .这里的重要部分是sortextra(search_after=)

search = Search(using=client, index='some-index')

# The main query
search = search.extra(size=100)
search = search.query('range', **{'@timestamp': {'gte': '2020-12-29T09:00', 'lt': '2020-12-29T09:59'}})
search = search.source(fields=('@timestamp', ))
search = search.sort({
    '@timestamp': {
        'order': 'desc'
    },
})

# Store all the results (it would be better to be wrap all this in a generator to be performant)
hits = []

# Get the 1st page
results = search.execute()
hits.extend(results.hits)
total = results.hits.total
print(f'Expecting {total}')

# Get the next pages
# Real use-case condition should be "until total" or "until no more results.hits"
while len(hits) < 1000:  
    print(f'Now have {len(hits)}')
    last_hit_sort_id = hits[-1].meta.sort[0]
    search = search.extra(search_after=[last_hit_sort_id])
    results = search.execute()
    hits.extend(results.hits)

with open('results.txt', 'w') as out:
    for hit in hits:
        out.write(f'{hit["@timestamp"]}\n')

That would lead to an already sorted data:这将导致已经排序的数据:

# 1st 10 lines
2020-12-29T09:58:57.749Z
2020-12-29T09:58:55.736Z
2020-12-29T09:58:53.627Z
2020-12-29T09:58:52.738Z
2020-12-29T09:58:47.221Z
2020-12-29T09:58:45.676Z
2020-12-29T09:58:44.523Z
2020-12-29T09:58:43.541Z
2020-12-29T09:58:40.116Z
2020-12-29T09:58:38.206Z
...
# 250-260
2020-12-29T09:50:31.117Z
2020-12-29T09:50:27.754Z
2020-12-29T09:50:25.738Z
2020-12-29T09:50:23.601Z
2020-12-29T09:50:17.736Z
2020-12-29T09:50:15.753Z
2020-12-29T09:50:14.491Z
2020-12-29T09:50:13.555Z
2020-12-29T09:50:07.721Z
2020-12-29T09:50:05.744Z
2020-12-29T09:50:03.630Z 
...
# 675-685
2020-12-29T09:43:30.609Z
2020-12-29T09:43:30.608Z
2020-12-29T09:43:30.602Z
2020-12-29T09:43:30.570Z
2020-12-29T09:43:30.568Z
2020-12-29T09:43:30.529Z
2020-12-29T09:43:30.475Z
2020-12-29T09:43:30.474Z
2020-12-29T09:43:30.468Z
2020-12-29T09:43:30.418Z
2020-12-29T09:43:30.417Z
...
# 840-850
2020-12-29T09:43:27.953Z
2020-12-29T09:43:27.929Z
2020-12-29T09:43:27.927Z
2020-12-29T09:43:27.920Z
2020-12-29T09:43:27.897Z
2020-12-29T09:43:27.895Z
2020-12-29T09:43:27.886Z
2020-12-29T09:43:27.861Z
2020-12-29T09:43:27.860Z
2020-12-29T09:43:27.853Z
2020-12-29T09:43:27.828Z
...
# Last 3
2020-12-29T09:43:25.878Z
2020-12-29T09:43:25.876Z
2020-12-29T09:43:25.869Z 

There are some considerations on using search_after as discussed in the API docs:如 API 文档中所述,使用search_after有一些注意事项:

  • Use a Point In Time or PIT parameter使用时间点或 PIT 参数
    • If a refresh occurs between these requests, the order of your results may change, causing inconsistent results across pages.如果在这些请求之间发生刷新,则结果的顺序可能会发生变化,从而导致页面之间的结果不一致。 To prevent this, you can create a point in time (PIT) to preserve the current index state over your searches.为防止这种情况,您可以创建一个时间点 (PIT)来保留搜索中的当前索引 state。

    • You need to first make a POST request to get a PIT ID您需要先发出 POST 请求以获取 PIT ID
    • Then add an extra 'pit': {'id':xxxx, 'keep_alive':5m} parameter to every request然后在每个请求中extra一个'pit': {'id':xxxx, 'keep_alive':5m}参数
    • Make sure to use the PIT ID from the last response确保使用上次响应中的 PIT ID
  • Use a tiebreaker使用决胜局
    • We recommend you include a tiebreaker field in your sort.我们建议您在排序中包含一个决胜局字段。 This tiebreaker field should contain a unique value for each document.此决胜局字段应包含每个文档的唯一值。 If you don't include a tiebreaker field, your paged results could miss or duplicate hits.如果您不包括决胜局字段,您的分页结果可能会丢失或重复匹配。

    • This would depend on your Document schema这将取决于您的文档架构
      # Add some ID as a tiebreaker to the `sort` call search = search.sort( {'@timestamp': { 'order': 'desc' }}, {'some.id': { 'order': 'desc' }} ) # Include both the sort ID and the some.ID in `search_after` last_hit_sort_id, last_hit_route_id = hits[-1].meta.sort search = search.extra(search_after=[last_hit_sort_id, last_hit_route_id])

Thank you Gino Mempin.谢谢 Gino Mempin。 It works!有用!

But I also figured out, that a simple change does the same job.但我也发现,一个简单的改变就可以完成同样的工作。

by adding .params(preserve_order=True) elasticsearch will sort all the data.通过添加.params(preserve_order=True) elasticsearch 将对所有数据进行排序。

es = Search(index="somelog-*").using(client)
notifications = (es
    .query("range", **{
        "@timestamp": {
            'gte': 'now-48h',
            'lt' : 'now'
        }
    })
    .sort("@timestamp")
    .params(preserve_order=True)
    .scan()
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM