简体   繁体   中英

Elasticsearch scroll scan query doesn't return all documents, missing first set

I'm trying to scroll my ES index and grab all the documents but it looks like I keep missing the first set of documents returned by the initial scroll. For example if my scroll size is 10 and my query returns a total of 100 after scrolling I would only have 90 documents. Any suggestions on what I'm missing?

Here's what I've currently tried:

$json = '{"query":{"bool":{"must":[{"match_all":{}}]}}}';

$params = [
    "scroll" => "1m",
    "size" => 50,
    "index" => "myindex",
    "type" => "mytype",
    "body" => $json 
];

$results = $client->search($params);
$scroll_size = $results['hits']['total']; // returns total docs that match query
$s_id = $results['_scroll_id'];

print " total results:   " . $scroll_size;

//scroll
$count = 0;
while ($scroll_size > 0) {
    print "  SCROLLING...";
    $scroll_results = $client->scroll([
        'scroll_id' => $s_id,
        'scroll' => '1m'
    ]);

    // get number of results returned in the last scroll
    $scroll_size = sizeof($scroll_results['hits']['hits']);
    print "  scroll size: " . $scroll_size;

    // do something with results
    for ($i=0; $i<$scroll_size; $i++) {
        $count++;
    }
}
print " total id count: " . $id_count;

the first query you execute to return number of documents, also returns documents. The first query is to establish the scroll and also to get the first set of documents. Once you process the first set of results, you can use the scroll_id to get the next page and so on.

Thanks @Ramdev. Yeah I realized that after a little digging. For anyone else Here's what ended up working for me:

$json = '{"query":{"bool":{"must":[{"match_all":{}}]}}}';
$count = 0;
$params = [
    "scroll" => "1m",
    "size" => 50,
    "index" => "myindex",
    "type" => "mytype",
    "body" => $json 
];

$results = $client->search($params);
$scroll_size = $results['hits']['total']; // returns total docs that match query
$s_id = $results['_scroll_id'];

print " total results:   " . $scroll_size;

// first set of scroll results
for ($i=0; $i<$size; $i++) {
    $count++;
}
//scroll
while ($scroll_size > 0) {
    print "  SCROLLING...";
    $scroll_results = $client->scroll([
        'scroll_id' => $s_id,
        'scroll' => '1m'
    ]);

    // get number of results returned in the last scroll
    $scroll_size = sizeof($scroll_results['hits']['hits']);
    print "  scroll size: " . $scroll_size;

    // do something with results
    for ($i=0; $i<$scroll_size; $i++) {
        $count++;
    }
}
print " total id count: " . $id_count;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM