简体   繁体   English

Elastic Search提取附件插件块

[英]Elastic Search ingest attachment plugin blocks

I am using NEST (C#) and the ingest attachment plugin to ingest 10s of thousands of documents into an Elastic search instance. 我正在使用NEST(C#)和摄取附件插件将成千上万的文档摄取到Elastic搜索实例中。 Unfortunately, after a while everything just stands still - ie no more documents are ingested. 不幸的是,过了一会儿一切都停滞不前-即不再提取任何文档。 The log shows: 日志显示:

[2019-02-20T17:35:07,528][INFO ][o.e.m.j.JvmGcMonitorService] [BwAAiDl] [gc][7412] overhead, spent [326ms] collecting in the last [1s]

Not sure if this tells anyone anything? 不知道这是否告诉任何人吗? Btw, are there more efficient ways to ingest many documents (rather than using thousands of REST requests)? 顺便说一句,有没有更有效的方式来提取许多文档(而不是使用数千个REST请求)?

I am using this kind of code: 我正在使用这种代码:

client.Index(new Document
{
    Id = Guid.NewGuid(),
    Path = somePath,
    Content = Convert.ToBase64String(File.ReadAllBytes(somePath))
}, i => i.Pipeline("attachments"));

Define the pipeline: 定义管道:

client.PutPipeline("attachments", p => p
    .Description("Document attachment pipeline")
    .Processors(pr => pr
        .Attachment<Document>(a => a
        .Field(f => f.Content)
        .TargetField(f => f.Attachment)
        )
        .Remove<Document>(r => r
        .Field(f => f.Content)
        )
    )
);

The log indicates that a considerable amount of time is being spent performing Garbage Collection on the Elasticsearch server side; 日志表明,在Elasticsearch服务器端执行垃圾收集花费了大量时间; this is very likely to be the cause of large stop events that you are seeing. 这很可能是您所看到的大型停止事件的原因。 If you have monitoring enabled on the cluster (ideally exporting such data to a separate cluster), I would look at analysing those to see if it sheds some light on why large GC is happening. 如果您在群集上启用了监视(最好将此类数据导出到一个单独的群集),那么我将分析这些数据,以了解它是否可以说明为什么发生大型GC。

are there more efficient ways to ingest many documents (rather than using thousands of REST requests)? 有没有更有效的方式来提取许多文档(而不是使用数千个REST请求)?

Yes, you are indexing each attachment in a separate index request. 是的,您正在为单独的索引请求中的每个附件建立索引。 Depending on the size of each attachment, base64 encoded, you may want to send several in one bulk request 根据base64编码的每个附件的大小,您可能希望在一个批量请求中发送多个附件

// Your collection of documents
var documents = new[]
{
    new Document
    {
        Id = Guid.NewGuid(),
        Path = "path",
        Content = "content"
    },
    new Document
    {
        Id = Guid.NewGuid(),
        Path = "path",
        Content = "content" // base64 encoded bytes
    }
};

var client = new ElasticClient();

var bulkResponse = client.Bulk(b => b
    .Pipeline("attachments")
    .IndexMany(documents)
);

If you're reading documents from the filesystem, you probably want to lazily enumerate them and send bulk requests. 如果您正在从文件系统中读取文档,则可能需要延迟枚举它们并发送批量请求。 Here, you can make use of the BulkAll helper method too. 在这里,您也可以使用BulkAll帮助程序方法。

First have some lazily enumerated collection of documents 首先要收集一些懒散的文档

public static IEnumerable<Document> GetDocuments()
{
    var count = 0;
    while (count++ < 20)
    {
        yield return new Document
        {
            Id = Guid.NewGuid(),
            Path = "path",
            Content = "content" // base64 encoded bytes
        };
    }
}

Then configure the BulkAll call 然后配置BulkAll调用

var client = new ElasticClient();

// set up the observable configuration
var bulkAllObservable = client.BulkAll(GetDocuments(), ba => ba
    .Pipeline("attachments")
    .Size(10)
);

var waitHandle = new ManualResetEvent(false);

Exception exception = null;

// set up what to do in response to next bulk call, exception and completion
var bulkAllObserver = new BulkAllObserver(
    onNext: response => 
    {
        // perform some action e.g. incrementing counter
        // to indicate how many have been indexed
    },
    onError: e =>
    {
        exception = e;
        waitHandle.Set();
    },
    onCompleted: () =>
    {
        waitHandle.Set();
    });

// start the observable process
bulkAllObservable.Subscribe(bulkAllObserver);

// wait for indexing to finish, either forever,
// or set a max timeout as here.
waitHandle.WaitOne(TimeSpan.FromHours(1));

if (exception != null)
    throw exception;

Size dictates how many documents to send in each request. 大小决定了每个请求中要发送多少个文档。 There are no hard and fast rules for how big this can be for your cluster, because it can depend on a number of factors including ingest pipeline, the mapping of documents, the byte size of documents, the cluster hardware etc. You can configure the observable to retry documents that fail to be indexed, and if you see es_rejected_execution_exception , you are at the limits of what your cluster can concurrently handle. 对于群集的大小,没有硬性规定,因为它取决于许多因素,包括摄取管道,文档的映射,文档的字节大小,群集硬件等。您可以配置可以观察到,重试无法建立索引的文档,并且如果看到es_rejected_execution_exception ,那么您将无法同时处理集群。

Another recommendation is that of document ids. 另一个建议是文档ID。 I see you're using new Guids for the ids of documents, which implies to me that you don't care what the value is for each document. 我看到您正在使用新的Guid作为文档ID,这对我而言意味着您不在乎每个文档的值是什么。 If that is the case, I would recommend not sending an Id value, and instead allow Elasticsearch to generate an id for each document. 如果是这种情况,我建议您不要发送ID值,而应让Elasticsearch为每个文档生成一个ID。 This is very likely to result in an improvement in performance (I believe the implementation had changed slightly further in Elasticsearch and Lucene since this post, but the point still stands) . 这很有可能导致性能提高 (我相信自从发布此帖子以来,Elasticsearch和Lucene的实现已稍作更改,但这一点仍然成立)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM