简体   繁体   English

Elasticsearch如何检查批量索引请求的状态?

[英]Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes ( files here ), based on the cshapes dataset. 我正在基于cshapes数据集对包含国家/地区形状的Elasticsearch文档进行批量索引( 此处为文件 )。

The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon" , and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem). "geometry":{"type":"MultiPolygon""geometry":{"type":"MultiPolygon"有很多点,并且批量请求需要很长时间才能完成(有时无法完成,这是一个单独的且已经报告的问题)。

Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values. 由于客户端超时(我使用官方的ES node.js),因此我希望有一种方法来检查批量请求的状态,而不必使用大量的超时值。

What I would like is to have a status such as active/running, completed or aborted. 我想要的是具有活动/正在运行,已完成或已中止的状态。 I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted. 我想仅通过查询批处理中的单个文档就不会告诉我请求是否已中止。

Is this possible? 这可能吗?

Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here . Elasticsearch 在此处没有提供检查正在进行的Bulk请求文档参考状态的方法。

First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. 首先,检查您的请求是否以较小的输入成功,所以您知道发出请求的方式没有问题。 Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel. 其次,尝试将数据分成较小的块,并在其上并行调用Bulk API。

You can also try with a higher request_timeout value, but I guess that is something you don't want to do. 您也可以尝试使用更高的request_timeout值,但是我想这是您不想做的事情。

I'm not sure if this is exactly what you're looking for, but may be helpful. 我不确定这是否正是您要寻找的东西,但可能会有所帮助。 Whenever I'm curious about what my cluster is doing, I check out the tasks API . 每当对集群的功能感到好奇时,我都会检查API任务

The tasks API shows you all of the tasks that are currently running on your cluster. 任务API向您显示集群上当前正在运行的所有任务。 It will give you information about individual tasks, such as the task ID, start time, and running time. 它将为您提供有关各个任务的信息,例如任务ID,开始时间和运行时间。 Here's the command: 这是命令:

curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool

just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). 只是一个侧面提示,说明为什么您的请求可能要花费很多时间(除非您在单个批量运行中只是索引了太多索引)。 If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct , otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing. 如果您已经为地理形状配置了自己的precision ,则还请确保您正在配置distance_error_pct ,否则不会出现错误,从而导致文档中包含大量术语的索引需要花费大量时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM