简体   繁体   English

如何从 Rails 中的 elasticsearch 中检索所有记录

[英]How to retreive all the records from elasticsearch in rails

There is an upper limit on the number of docs you can get from elastic search(that is 10000).您可以从弹性搜索中获得的文档数量有上限(即 10000)。 we can use "scroll" to retrieve all the records.我们可以使用“滚动”来检索所有记录。 Does anyone know how to embed this in code?有谁知道如何将其嵌入代码中?

There is this method scroll有这个方法滚动

https://github.com/elastic/elasticsearch-ruby/blob/4608fd144277941003de71a0cdc24bd39f17a012/elasticsearch-api/lib/elasticsearch/api/actions/scroll.rb https://github.com/elastic/elasticsearch-ruby/blob/4608fd144277941003de71a0cdc24bd39f17a012/elasticsearch-api/lib/elasticsearch/api/actions/scroll.rb

But I don't know how to use it.但我不知道如何使用它。 Could you explain how to use it?你能解释一下如何使用它吗?

I have tried the "scan".我试过“扫描”。 But it is no longer supported in Elasticsearch anymore.但 Elasticsearch 不再支持它。

# Open the "view" of the index
response = client.search index: 'test', search_type: 'scan', scroll: '5m', size: 10

# Call `scroll` until results are empty
while response = client.scroll(scroll_id: response['_scroll_id'], scroll: '5m') and not 
   response['hits']['hits'].empty? do
      puts response['hits']['hits'].map { |r| r['_source']['title'] }
end

Your code should work, but as you mentioned the scan parameter for search_type is not necessary.您的代码应该可以工作,但正如您提到的, search_typescan参数不是必需的。 I just ran this locally with some test data and it worked:我只是用一些测试数据在本地运行它并且它有效:

# scroll.rb
require 'elasticsearch'

client = Elasticsearch::Client.new

response = client.search(index: 'articles', scroll: '10m')
scroll_id = response['_scroll_id']
while response['hits']['hits'].size.positive?
  response = client.scroll(scroll: '5m', body: { scroll_id: scroll_id })
  puts(response['hits']['hits'].map { |r| r['_source']['title'] })
end

Output: Output:

$ ruby scroll.rb                                                                                         
Title 297                                                                                                
Title 298                                                                                                
Title 299                                                                                                
Title 300
...

You can fiddle around with the value for the scroll parameter, but something like this should work for you too.您可以摆弄scroll参数的值,但这样的东西也应该适用于您。

paragraph from elastic official docs:来自弹性官方文档的段落:

We no longer recommend using the scroll API for deep pagination.我们不再推荐使用滚动 API 进行深度分页。 If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).如果您需要在分页超过 10,000 次点击时保留索引 state,请使用带有时间点 (PIT) 的 search_after 参数。

Scroll Official Doc Link 滚动官方文档链接

I recommend to use pagination.我建议使用分页。

you can use您可以使用

that limitation in number of hits is for performance inprovements, you can use pagination, Its much faster.命中次数的限制是为了提高性能,您可以使用分页,它的速度要快得多。

in this way you can use start point with form key or use search_after key with sort and PIT(point in time for prevent from inconsistent result).这样,您可以使用带有form键的起点或使用带有sort和 PIT 的search_after键(防止结果不一致的时间点)。 and you can determinate you hits size key with 10 for faster query time.并且您可以确定您使用 10 的size键来加快查询时间。

Pagination Official Doc Link 分页官方文档链接

for instantiate PIT ID:用于实例化 PIT ID:

POST /test/_pit?keep_alive=1m

for instantiate pagination:用于实例化分页:

GET /test/_search
{
  "size": 10,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [ 
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos", "numeric_type" : "date_nanos" }}
  ]
}

for get rest of data in pagination: there is sort key in the result, put it in the search_after获取分页数据的rest:结果中有排序键,放在search_after

GET /test/_search
{
  "size": 10,
  "query": {
    "match" : {
      "user.id" : "elkbee"
    }
  },
  "pit": {
    "id":  "46ToAwMDaWR5BXV1aWQyKwZub2RlXzMAAAAAAAAAACoBYwADaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQADaWR5BXV1aWQyKgZub2RlXzIAAAAAAAAAAAwBYgACBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", 
    "keep_alive": "1m"
  },
  "sort": [
    {"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}}
  ],
  "search_after": [                                
    "2021-05-20T05:30:04.832Z", #you can find this value from sort key in response 
    4294967298
  ],
  "track_total_hits": false                        
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM