简体   繁体   中英

ElasticSearch Indexing 100K documents with BulkRequest API using Java RestHighLevelClient

Am reading 100k plus file path from the index documents_qa using scroll API. Actual files will be available in my local d:\\drive . By using the file path am reading the actual file and converting into base64 and am reindex with the base64 content (of a file) in another index document_attachment_qa .

My current implementation is, am reading filePath, convering the file into base64 and indexing document along with fileContent one by one. So its taking more time for eg:- indexing 4000 documents its taking more than 6 hours and also connection is terminating due to IO exception .

So now i want to index the documents using BulkRequest API, but am using RestHighLevelClient and am not sure how to using BulkRequest API along with RestHighLevelClient .

Please find my current implementation, which am indexing one by one document.

jsonMap = new HashMap<String, Object>();
            jsonMap.put("id", doc.getId());
            jsonMap.put("app_language", doc.getApp_language());
            jsonMap.put("fileContent", result);

            String id=Long.toString(doc.getId());

IndexRequest request = new IndexRequest(ATTACHMENT, "doc", id ) // ATTACHMENT is the index name
                    .source(jsonMap) // Its my single document.
                    .setPipeline(ATTACHMENT);

IndexResponse response = SearchEngineClient.getInstance3().index(request); // increased timeout 

I found the below documentation for BulkRequest .

https://www.elastic.co/guide/en/elasticsearch/client/java-api/current/java-docs-bulk.html

But am not sure how to implement BulkRequestBuilder bulkRequest = client.prepareBulk(); client.prepareBulk() method when and using RestHighLevelClient .

UPDATE 1

Am trying to indexing all 100K documents in one shot. so i creating one JSONArray and put all my JSONObject into the array one by one. Finally am trying to build BulkRequest and add all my documents (JSONArray) as a source to the BulkRequest and trying to index them.

Here am not sure, how to convert my JSONArray to List of String.

private final static String ATTACHMENT = "document_attachment_qa";
private final static String TYPE = "doc";
JSONArray reqJSONArray=new JSONArray();

while (searchHits != null && searchHits.length > 0) { 
...
...
    jsonMap = new HashMap<String, Object>();
    jsonMap.put("id", doc.getId());
    jsonMap.put("app_language", doc.getApp_language());
    jsonMap.put("fileContent", result);

    reqJSONArray.put(jsonMap)
}

String actionMetaData = String.format("{ \"index\" : { \"_index\" : \"%s\", \"_type\" : \"%s\" } }%n", ATTACHMENT, TYPE);
List<String> bulkData =   // not sure how to convert a list of my documents in JSON strings    
StringBuilder bulkRequestBody = new StringBuilder();
for (String bulkItem : bulkData) {
    bulkRequestBody.append(actionMetaData);
    bulkRequestBody.append(bulkItem);
    bulkRequestBody.append("\n");
}

HttpEntity entity = new NStringEntity(bulkRequestBody.toString(), ContentType.APPLICATION_JSON);
try {
    Response response = SearchEngineClient.getRestClientInstance().performRequest("POST", "/ATTACHMENT/TYPE/_bulk", Collections.emptyMap(), entity);
    return response.getStatusLine().getStatusCode() == HttpStatus.SC_OK;
} catch (Exception e) {
    // do something
}

You can just new BulkRequest() and add the requests without using BulkRequestBuilder , like:

BulkRequest request = new BulkRequest();
request.add(new IndexRequest("foo", "bar", "1")
        .source(XContentType.JSON,"field", "foobar"));
request.add(new IndexRequest("foo", "bar", "2")
        .source(XContentType.JSON,"field", "foobar"));
...
BulkResponse bulkResponse = myHighLevelClient.bulk(request, RequestOptions.DEFAULT);

In addition to @chengpohi answer. I would like to add below points:

A BulkRequest can be used to execute multiple index, update and/or delete operations using a single request.

It requires at least one operation to be added to the Bulk request:

BulkRequest request = new BulkRequest(); 
request.add(new IndexRequest("posts", "doc", "1")  
        .source(XContentType.JSON,"field", "foo"));
request.add(new IndexRequest("posts", "doc", "2")  
        .source(XContentType.JSON,"field", "bar"));
request.add(new IndexRequest("posts", "doc", "3")  
        .source(XContentType.JSON,"field", "baz"));

Note: The Bulk API supports only documents encoded in JSON or SMILE. Providing documents in any other format will result in an error.

Synchronous Operation:

BulkResponse bulkResponse = client.bulk(request, RequestOptions.DEFAULT);

client will be High-Level Rest Client and execution will be synchronous.

Asynchronous Operation(Recommended Approach):

client.bulkAsync(request, RequestOptions.DEFAULT, listener);

The asynchronous execution of a bulk request requires both the BulkRequest instance and an ActionListener instance to be passed to the asynchronous method.

Listener Example:

ActionListener<BulkResponse> listener = new ActionListener<BulkResponse>() {
    @Override
    public void onResponse(BulkResponse bulkResponse) {

    }

    @Override
    public void onFailure(Exception e) {

    }
};

The returned BulkResponse contains information about the executed operations and allows to iterate over each result as follows:

for (BulkItemResponse bulkItemResponse : bulkResponse) { 
    DocWriteResponse itemResponse = bulkItemResponse.getResponse(); 

    if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.INDEX
            || bulkItemResponse.getOpType() == DocWriteRequest.OpType.CREATE) { 
        IndexResponse indexResponse = (IndexResponse) itemResponse;

    } else if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.UPDATE) { 
        UpdateResponse updateResponse = (UpdateResponse) itemResponse;

    } else if (bulkItemResponse.getOpType() == DocWriteRequest.OpType.DELETE) { 
        DeleteResponse deleteResponse = (DeleteResponse) itemResponse;
    }
}

The following arguments can optionally be provided:

request.timeout(TimeValue.timeValueMinutes(2)); 
request.timeout("2m");

I hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM