简体   繁体   中英

AWS SDK CloudSearch pagination

I'm using PHP AWS SDK to communicate with CloudSearch. According to this post , pagination can be done with either cursor or start parameters. But when you have more than 10,000 hits, you can't use start .

When using start , I can specify ['start' => 1000, 'size' => 100] to get directly to 10th page.
How to get to 1000th page (or any other random page) using cursor ? Maybe there is any way to calculate this parameter?

I would LOVE there to be a better way but here goes...

One thing I've discovered with cursors is that they return the same value for duplicate search requests when seeking on the same data set, so don't think of them as sessions. Whilst your data isn't updating you can effectively cache aspects of your pagination for multiple users to consume.

I've came up with this solution and have tested it with 75,000+ records.

1) Determine if your start is going to be under the 10k Limit, if so use the non-cursor search, otherwise when seeking past 10K, first perform a search with an initial cursor and a size of 10K and return _no_fields . This gives is our starting offset and the no fields speeds up how much data we have to consume, we don't need these ID's anyway

2) Figure out your target offset, and plan how many iterations it will take to position the cursor just before your targeted page of results. I then iterate and cache the results using my request as the cache hash.

For my iteration I started with a 10K blocks then reduce the size to 5k then 1k blocks as I start getting "closer" to the target offset, this means subsequent pagination are using a previous cursor that's a bit closer to the last chunk.

eg what this might look like is:

  • Fetch 10000 Records (initial cursor)
  • Fetch 5000 Records
  • Fetch 5000 Records
  • Fetch 5000 Records
  • Fetch 5000 Records
  • Fetch 1000 Records
  • Fetch 1000 Records

This will help me to get to the block that's around the 32,000 offset mark. If I then need to get to 33,000 I can used my cached results to get the cursor that will have returned the previous 1000 and start again from that offset...

  • Fetch 10000 Records (cached)
  • Fetch 5000 Records (cached)
  • Fetch 5000 Records (cached)
  • Fetch 5000 Records (cached)
  • Fetch 5000 Records (cached)
  • Fetch 1000 Records (cached)
  • Fetch 1000 Records (cached)
  • Fetch 1000 Records (works using cached cursor)

3) now that we're in the "neighborhood" of your target result offset you can start specifying page sizes to just before your destination. and then you perform the final search to get your actual page of results.

4) If you add or delete documents from your index you will need a mechanism for invalidating your previous cached results. I've done this by storing a time stamp of when the index was last updated and using that as part of the cache key generation routine.

What is important is the cache aspect, you should build a cache mechanism that uses the request array as your cache hash key so it can be easily created/referenced.

For a non-seeded cache this approach is SLOW but if you can warm up the cache and only expire it when there's a change to the indexed documents (and then warm it up again), your users will be unable to tell.

This code idea works on 20 items per page, I'd love to work on this and see how I could code it smarter/more efficient, but the concept is there...

// Build $request here and set $request['start'] to be the offset you want to reach

// Craft getCache() and setCache() functions or methods for cache handling.

// have $cloudSearchClient as your client

if(isset($request['start']) === true and $request['start'] >= 10000)
{
  $originalRequest = $request;
  $cursorSeekTarget = $request['start'];
  $cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
  $cursorSeekOffset = 0;
  $request['return'] = '_no_fields';
  $request['cursor'] = 'initial';
  unset($request['start'],$request['facet']);
  // While there is outstanding work to be done...
  while( $cursorSeekAmount > 0 )
  {
    $request['size'] = $cursorSeekAmount;
    // first hit the local cache
    if(empty($result = getCache($request)) === true)
    {
      $result = $cloudSearchClient->Search($request);
      // store the results in the cache
      setCache($request,$result);
    }
    if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
    {
      // prepare the next request with the cursor
      $request['cursor'] = $hits['cursor'];
    }
    $cursorSeekOffset = $cursorSeekOffset + $request['size'];
    if($cursorSeekOffset >= $cursorSeekTarget)
    {
      $cursorSeekAmount = 0; // Finished, no more work
    }
    // the first request needs to get 10k, but after than only get 5K
    elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
    {
      $cursorSeekAmount = 5000;
    }
    elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
    {
      $cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
      // if we still need to seek more than 5K records, limit it back again to 5K
      if($cursorSeekAmount > 5000)
      {
        $cursorSeekAmount = 5000;
      }
      // if we still need to seek more than 1K records, limit it back again to 1K
      elseif($cursorSeekAmount > 1000)
      {
        $cursorSeekAmount = 1000;
      }
    }
  }
  // Restore aspects of the original request (the actual 20 items)
  $request['size'] = 20;
  $request['facet'] = $originalRequest['facet'];
  unset($request['return']); // get the default returns
  if(empty($result = getCache($request)) === true)
  {
    $result = $cloudSearchClient->Search($request);
    setCache($request,$result);
  }
}
else
{
  // No cursor required
  $result = $cloudSearchClient->Search( $request );
}

Please note this was done using a custom AWS client and not the official SDK class, but the request and search structures should be comparable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM