繁体   English   中英

使用 Python 中的 Elasticsearch API 时删除的文档

[英]Deleted documents when using Elasticsearch API from Python

I'm relatively new to Elasticsearch and am having a problem determining why the number of records from a python dataframe is different than the indexes document count Elasticsearch .

我首先通过运行以下命令创建索引: 如您所见,有 62932 条记录。

我正在使用以下代码在 elasticsearch 中创建索引: Python 代码

当我在Kibana Management/Index Management中检查索引时,只有 62630 个文档。 根据统计 window 有 302 个删除计数。 我不知道这是什么意思。

下面是来自 STATS window 的 output

{ "_shards": { "total": 2, "successful": 1, "failed": 0 }, "stats": { "uuid": "egOx_6EwTFysBr0WkJyR1Q", "primaries": { "docs": { "count": 62630, "deleted": 302 }, "store": { "size_in_bytes": 4433722 }, "indexing": { "index_total": 62932, "index_time_in_millis": 3235, "index_current": 0, "index_failed": 0, "delete_total": 0, "delete_time_in_millis": 0, "delete_current": 0, "noop_update_total": 0, "is_throttled": false, "throttle_time_in_millis": 0 }, "get": { "total": 0, "time_in_millis": 0, "exists_total": 0, "exists_time_in_millis": 0, "missing_total": 0, "missing_time_in_millis": 0, "current": 0 }, "search": { "open_contexts": 0, "query_total": 140, "query_time_in_millis": 1178, "query_current": 0, "fetch_total": 140, "fetch_time_in_millis": 1233, "fetch_current": 0, "scroll_total": 1, "scroll_time_in_millis": 6262, "scroll_current": 0, "suggest_total": 0, "suggest_time_in_millis": 0, "suggest_current": 0 }, "merges": { "current": 0, "current_docs": 0, "current_size_in_bytes": 0, "total": 2, "total_time_in_millis": 417, "total_docs": 62932, "total_size_in_bytes": 4882755, "total_stopped_time_in_millis": 0, "total_throttled_time_in_millis": 0, "total_auto_throttle_in_bytes": 20971520 }, "refresh": { "total": 26, "total_time_in_millis": 597, "external_total": 24, "external_total_time_in_millis": 632, "listeners": 0 }, "flush": { "total": 1, "periodic": 0, "total_time_in_millis": 10 }, "warmer": { "current": 0, "total": 23, "total_time_in_millis": 0 }, "query_cache": { "memory_size_in_bytes": 17338, "total_count": 283, "hit_count": 267, "miss_count": 16, "cache_size": 4, "cache_count": 4, "evictions": 0 }, "fielddata": { "memory_size_in_bytes": 0, "evictions": 0 }, "completion": { "size_in_bytes": 0 }, "segments": { "count": 2, "memory_in_bytes": 22729, "terms_memory_in_bytes": 17585, "stored_fields_memory_in_bytes": 2024, "term_vectors_memory_in_bytes": 0, "norms_memory_in_bytes": 512, "points_memory_in_bytes": 2112, "doc_values_memory_in_bytes": 496, "index_writer_memory_in_bytes": 0, "version_map_memory_in_bytes": 0, "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": -1, "file_sizes": {} }, "translog": { "operations": 62932, "size_in_bytes": 17585006, "uncommitted_operations": 0, "uncommitted_size_in_bytes": 55, "earliest_last_modified_age": 0 }, "request_cache": { "memory_size_in_bytes": 0, "evictions": 0, "hit_count": 0, "miss_count": 0 }, "recovery": { "current_as_source": 0, "current_as_target": 0, "throttle_time_in_millis": 0 } }, "total": { "docs": { "count": 62630, "deleted": 302 }, "store": { "size_in_bytes": 4433722 }, "indexing": { "index_total": 62932, "index_time_in_millis": 3235, "index_current": 0, "index_failed": 0, "delete_total": 0, "delete_time_in_millis": 0, "delete_current": 0, "noop_update_total": 0, "is_throttled": false, "throttle_time_in_millis": 0 }, "get": { "total": 0, "time_in_millis": 0, "exists_total": 0, "exists_time_in_millis": 0, "missing_total": 0, "missing_time_in_millis": 0, "current": 0 }, "search": { "open_contexts": 0, "query_total": 140, "query_time_in_millis": 1178, "query_current": 0, "fetch_total": 140, "fetch_time_in_millis": 1233, "fetch_current": 0, "scroll_total": 1, "scroll_time_in_millis": 6262, "scroll_current": 0, "suggest_total": 0, "suggest_time_in_millis": 0, "suggest_current": 0 }, "merges": { "current": 0, "current_docs": 0, "current_size_in_bytes": 0, "total": 2, "total_time_in_millis": 417, "total_docs": 62932, "total_size_in_bytes": 4882755, "total_stopped_time_in_millis": 0, "total_throttled_time_in_millis": 0, "total_auto_throttle_in_bytes": 20971520 }, "refresh": { "total": 26, "total_time_in_millis": 597, "external_total": 24, "external_total_time_in_millis": 632, "listeners": 0 }, "flush": { "total": 1, "periodic": 0, "total_time_in_millis": 10 }, "warmer": { "current": 0, "total": 23, "total_time_in_millis": 0 }, "query_cache": { "memory_size_in_bytes": 17338, "total_count": 283, "hit_count": 267, "miss_count": 16, "cache_size": 4, "cache_count": 4, "evictions": 0 }, "fielddata": { "memory_size_in_bytes": 0, "evictions": 0 }, "completion": { "size_in_bytes": 0 }, "segments": { "count": 2, "memory_in_bytes": 22729, "terms_memory_in_bytes": 17585, "stored_fields_memory_in_bytes": 2024, "term_vectors_memory_in_bytes": 0, "norms_memory_in_bytes": 512, "points_memory_in_bytes": 2112, "doc_values_memory_in_bytes": 496, "index_writer_memory_in_bytes": 0, "version_map_memory_in_bytes": 0, "fixed_bit_set_memory_in_bytes": 0, "max_unsafe_auto_id_timestamp": -1, "file_sizes": {} }, "translog": { "operations": 62932, "size_in_bytes": 17585006, "uncommitted_operations": 0, "uncommitted_size_in_bytes": 55, "earliest_last_modified_age": 0 }, "request_cache": { "memory_size_in_bytes": 0, "evictions": 0, "hit_count": 0, "miss_count": 0 }, "recovery": { "current_as_source": 0, "current_as_target": 0, "throttle_time_in_millis": 0 } } } }

为什么文档计数与索引总数不同? 我已导出数据,记录数与文档数匹配。 如何找出文档被删除的原因并确保它们不会在将来出现?

可能的原因:

  • 删除的文档会占用索引中的磁盘空间。
  • 内存中的每个文档数据结构,例如规范或字段数据,仍会为已删除的文档消耗 RAM。
  • 搜索吞吐量较低,因为每次搜索都必须检查每个潜在命中的已删除位集。 更多关于这下面。
  • 用于查询评分的聚合术语统计信息仍将反映已删除的术语和文档。 当合并完成时,术语统计数据会突然跳到更接近其真实值,从而改变命中分数。 实际上,这种影响很小,除非删除的文档与索引的 rest 有不同的统计信息。
  • 一个已删除的文档会占用单个分片的最多 2.1 B 个文档中的一个文档 ID。 如果您的分片接近该限制(不推荐)。这可能很重要。
  • 模糊查询的结果可能略有不同,因为它们可能匹配幽灵术语。

https://www.elastic.co/guide/en/elasticsearch/reference/current//cat-indices.html https-handlingof-delete-document//

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM