简体   繁体   English

Elasticsearch扫描和滚动 - 添加到新索引

[英]Elasticsearch scan and scroll - add to new index

Elasticsearch and command line programming noobie question. Elasticsearch和命令行编程noobie问题。

I have elasticsearch set up locally on my computer and want to pull documents from a server that uses a different version of es using the scan and scroll api and add them into my index. 我在我的计算机上本地设置了elasticsearch,并希望使用扫描和滚动API从使用不同版本的es的服务器中提取文档,并将它们添加到我的索引中。 I am having trouble figuring out how to do this with the bulk api for es. 我无法弄清楚如何使用批量api来实现这一点。

Right now in my testing phase I am just pulling a few documents from the server using the following code (which works): 现在在我的测试阶段,我只是使用以下代码从服务器中提取一些文档(可行):

   http MY-OLD-ES.com:9200/INDEX/TYPE/_search?size=1000 | jq   .hits.hits[] -c | while read x; do id="`echo "$x" | jq -r ._id`"; index="`echo "$x" | jq -r ._index`"; type="`echo "$x" | jq -r ._type`"; doc="`echo "$x" | jq ._source`"; http put "localhost:9200/junk-$index/$type/$id" <<<"$doc"; done

Any tips on how scan and scroll works (noob and a bit confused). 关于扫描和滚动如何工作的任何提示(noob和有点困惑)。 So far know I can scroll and get a scroll id, but I'm unclear what to do with the scroll id. 到目前为止我知道我可以滚动并获得滚动ID,但我不清楚如何处理滚动ID。 If I call 如果我打电话

http get http://MY-OLD-ES.com:9200/my_index/_search?scroll=1m&search_type=scan&size=10

I'll receive a scroll id. 我会收到一个滚动ID。 Can this be piped in and parsed the same way? 这可以通过管道传输并以相同的方式解析吗? Additionally, I believe I'll need a while loop to tell it to keep requesting. 另外,我相信我需要一个while循环来告诉它继续请求。 How exactly should I go about this? 我该怎么办呢?

Thanks! 谢谢!

The scan and scroll documentation explains it pretty clearly. 扫描和滚动文档非常清楚地解释了它。 After you get the scroll_id (a long base64 encoded string), you pass it in with the body of the request. 获得scroll_id (一个长的base64编码的字符串)后,将其与请求的主体一起传递。 With curl the request would looks something like this: 使用curl,请求看起来像这样:

curl -XGET 'http://MY-OLD-ES.com:9200/_search/scroll?scroll=1m' -d '
c2Nhbjs1OzExODpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExOTpRNV9aY1VyUVM4U0 
NMd2pjWlJ3YWlBOzExNjpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzExNzpRNV9aY1Vy
UVM4U0NMd2pjWlJ3YWlBOzEyMDpRNV9aY1VyUVM4U0NMd2pjWlJ3YWlBOzE7dG90YW
xfaGl0czoxOw==
'

Notice that while the first request to open the scroll was to /my_index/_search , the second request to read the data was to /_search/scroll . 请注意,虽然第一个打开滚动的请求是/my_index/_search ,但第二个读取数据的请求是/_search/scroll Each time you call that, passing the ?scroll=1m querystring, it refreshes the timeout before the scroll is automatically closed. 每次调用它时,传递?scroll=1m查询字符串,它会在滚动自动关闭之前刷新超时。

There are two more things to be aware of: 还有两件事需要注意:

  1. The size you pass when opening the scroll applies to each shard, so you will get size multiplied by the number of shards in your index on each request. size您打开滚动时通过适用于每个碎片,所以你会得到size在每个请求索引乘以碎片的数量。
  2. Each request to /_search/scroll will return a new scroll_id which you must pass on the next call to get the next batch of results. /_search/scroll每个请求都将返回一个新的scroll_id ,您必须在下次调用时传递该scroll_id才能获得下一批结果。 You can't just keep calling with the same scroll_id . 你不能只用相同的scroll_id继续调用。

It is complete when no hits are returned in the scroll request. 当滚动请求中没有返回命中时,它就完成了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM