简体   繁体   English

AWS SDK CloudSearch分页

[英]AWS SDK CloudSearch pagination

I'm using PHP AWS SDK to communicate with CloudSearch. 我正在使用PHP AWS SDK与CloudSearch进行通信。 According to this post , pagination can be done with either cursor or start parameters. 根据这篇文章 ,可以使用cursorstart参数进行分页。 But when you have more than 10,000 hits, you can't use start . 但是当你有超过10,000次点击时,你就无法使用start

When using start , I can specify ['start' => 1000, 'size' => 100] to get directly to 10th page. 使用start ,我可以指定['start' => 1000, 'size' => 100]直接到第10页。
How to get to 1000th page (or any other random page) using cursor ? 如何使用cursor到达第1000页(或任何其他随机页面)? Maybe there is any way to calculate this parameter? 也许有办法计算这个参数?

I would LOVE there to be a better way but here goes... 我会喜欢那里有一个更好的方式但是这里...

One thing I've discovered with cursors is that they return the same value for duplicate search requests when seeking on the same data set, so don't think of them as sessions. 我用游标发现的一件事是,当在同一数据集上搜索时,它们为重复的搜索请求返回相同的值,因此不要将它们视为会话。 Whilst your data isn't updating you can effectively cache aspects of your pagination for multiple users to consume. 虽然您的数据没有更新,但您可以有效地缓存分页的各个方面以供多个用户使用。

I've came up with this solution and have tested it with 75,000+ records. 我想出了这个解决方案并用75,000多条记录进行了测试。

1) Determine if your start is going to be under the 10k Limit, if so use the non-cursor search, otherwise when seeking past 10K, first perform a search with an initial cursor and a size of 10K and return _no_fields . 1)确定您的开始是否将低于10k限制,如果是,则使用非光标搜索,否则在寻找过去的10K时,首先使用initial光标和10K的大小执行搜索并返回_no_fields This gives is our starting offset and the no fields speeds up how much data we have to consume, we don't need these ID's anyway 这给出了我们的起始偏移量,没有字段加速了我们必须消耗的数据量,我们不需要这些ID

2) Figure out your target offset, and plan how many iterations it will take to position the cursor just before your targeted page of results. 2)找出目标偏移量,并计划将光标放在目标结果页面之前需要多少次迭代。 I then iterate and cache the results using my request as the cache hash. 然后我使用我的请求迭代并缓存结果作为缓存哈希。

For my iteration I started with a 10K blocks then reduce the size to 5k then 1k blocks as I start getting "closer" to the target offset, this means subsequent pagination are using a previous cursor that's a bit closer to the last chunk. 对于我的迭代,我开始使用10K块然后将大小减小到5k然后减少1k块,因为我开始越来越接近目标偏移量,这意味着后续分页使用的前一个光标更接近最后一个块。

eg what this might look like is: 例如,这看起来像是:

  • Fetch 10000 Records (initial cursor) 获取10000条记录(初始光标)
  • Fetch 5000 Records 获取5000条记录
  • Fetch 5000 Records 获取5000条记录
  • Fetch 5000 Records 获取5000条记录
  • Fetch 5000 Records 获取5000条记录
  • Fetch 1000 Records 获取1000条记录
  • Fetch 1000 Records 获取1000条记录

This will help me to get to the block that's around the 32,000 offset mark. 这将帮助我到达32,000偏移标记附近的区块。 If I then need to get to 33,000 I can used my cached results to get the cursor that will have returned the previous 1000 and start again from that offset... 如果我需要达到33,000,我可以使用我的缓存结果来获取将返回前一个1000并从该偏移量再次开始的光标...

  • Fetch 10000 Records (cached) 获取10000条记录(缓存)
  • Fetch 5000 Records (cached) 获取5000条记录(缓存)
  • Fetch 5000 Records (cached) 获取5000条记录(缓存)
  • Fetch 5000 Records (cached) 获取5000条记录(缓存)
  • Fetch 5000 Records (cached) 获取5000条记录(缓存)
  • Fetch 1000 Records (cached) 获取1000条记录(缓存)
  • Fetch 1000 Records (cached) 获取1000条记录(缓存)
  • Fetch 1000 Records (works using cached cursor) 获取1000条记录(使用缓存光标工作)

3) now that we're in the "neighborhood" of your target result offset you can start specifying page sizes to just before your destination. 3)现在我们处于目标结果偏移的“邻域”,您可以开始在目的地之前指定页面大小。 and then you perform the final search to get your actual page of results. 然后执行最终搜索以获得实际的结果页面。

4) If you add or delete documents from your index you will need a mechanism for invalidating your previous cached results. 4)如果您在索引中添加或删除文档,则需要一种机制来使先前的缓存结果无效。 I've done this by storing a time stamp of when the index was last updated and using that as part of the cache key generation routine. 我通过存储上次更新索引的时间戳并将其用作缓存密钥生成例程的一部分来完成此操作。

What is important is the cache aspect, you should build a cache mechanism that uses the request array as your cache hash key so it can be easily created/referenced. 重要的是缓存方面,您应该构建一个使用请求数组作为缓存哈希键的缓存机制,以便可以轻松创建/引用它。

For a non-seeded cache this approach is SLOW but if you can warm up the cache and only expire it when there's a change to the indexed documents (and then warm it up again), your users will be unable to tell. 对于非种子缓存,这种方法是缓慢的,但是如果你可以预热缓存,只有当索引文档发生变化(然后再次热身)时它才会过期,你的用户将无法分辨。

This code idea works on 20 items per page, I'd love to work on this and see how I could code it smarter/more efficient, but the concept is there... 这个代码的想法每页有20个项目,我很乐意研究这个问题,看看我如何更智能/更高效地编写代码,但这个概念是......

// Build $request here and set $request['start'] to be the offset you want to reach

// Craft getCache() and setCache() functions or methods for cache handling.

// have $cloudSearchClient as your client

if(isset($request['start']) === true and $request['start'] >= 10000)
{
  $originalRequest = $request;
  $cursorSeekTarget = $request['start'];
  $cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
  $cursorSeekOffset = 0;
  $request['return'] = '_no_fields';
  $request['cursor'] = 'initial';
  unset($request['start'],$request['facet']);
  // While there is outstanding work to be done...
  while( $cursorSeekAmount > 0 )
  {
    $request['size'] = $cursorSeekAmount;
    // first hit the local cache
    if(empty($result = getCache($request)) === true)
    {
      $result = $cloudSearchClient->Search($request);
      // store the results in the cache
      setCache($request,$result);
    }
    if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
    {
      // prepare the next request with the cursor
      $request['cursor'] = $hits['cursor'];
    }
    $cursorSeekOffset = $cursorSeekOffset + $request['size'];
    if($cursorSeekOffset >= $cursorSeekTarget)
    {
      $cursorSeekAmount = 0; // Finished, no more work
    }
    // the first request needs to get 10k, but after than only get 5K
    elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
    {
      $cursorSeekAmount = 5000;
    }
    elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
    {
      $cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
      // if we still need to seek more than 5K records, limit it back again to 5K
      if($cursorSeekAmount > 5000)
      {
        $cursorSeekAmount = 5000;
      }
      // if we still need to seek more than 1K records, limit it back again to 1K
      elseif($cursorSeekAmount > 1000)
      {
        $cursorSeekAmount = 1000;
      }
    }
  }
  // Restore aspects of the original request (the actual 20 items)
  $request['size'] = 20;
  $request['facet'] = $originalRequest['facet'];
  unset($request['return']); // get the default returns
  if(empty($result = getCache($request)) === true)
  {
    $result = $cloudSearchClient->Search($request);
    setCache($request,$result);
  }
}
else
{
  // No cursor required
  $result = $cloudSearchClient->Search( $request );
}

Please note this was done using a custom AWS client and not the official SDK class, but the request and search structures should be comparable. 请注意,这是使用自定义AWS客户端而非官方SDK类完成的,但请求和搜索结构应具有可比性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM