AWS SDK CloudSearch分页

Question

我正在使用PHP AWS SDK与CloudSearch进行通信。 根据这篇文章，可以使用cursor或start参数进行分页。 但是当你有超过10,000次点击时，你就无法使用start 。

使用start ，我可以指定['start' => 1000, 'size' => 100]直接到第10页。
如何使用cursor到达第1000页（或任何其他随机页面）？ 也许有办法计算这个参数？

Answer 1

我会喜欢那里有一个更好的方式但是这里...

我用游标发现的一件事是，当在同一数据集上搜索时，它们为重复的搜索请求返回相同的值，因此不要将它们视为会话。 虽然您的数据没有更新，但您可以有效地缓存分页的各个方面以供多个用户使用。

我想出了这个解决方案并用75,000多条记录进行了测试。

1）确定您的开始是否将低于10k限制，如果是，则使用非光标搜索，否则在寻找过去的10K时，首先使用initial光标和10K的大小执行搜索并返回_no_fields 。 这给出了我们的起始偏移量，没有字段加速了我们必须消耗的数据量，我们不需要这些ID

2）找出目标偏移量，并计划将光标放在目标结果页面之前需要多少次迭代。 然后我使用我的请求迭代并缓存结果作为缓存哈希。

对于我的迭代，我开始使用10K块然后将大小减小到5k然后减少1k块，因为我开始越来越接近目标偏移量，这意味着后续分页使用的前一个光标更接近最后一个块。

例如，这看起来像是：

获取10000条记录（初始光标）
获取5000条记录
获取5000条记录
获取5000条记录
获取5000条记录
获取1000条记录
获取1000条记录

这将帮助我到达32,000偏移标记附近的区块。 如果我需要达到33,000，我可以使用我的缓存结果来获取将返回前一个1000并从该偏移量再次开始的光标...

获取10000条记录（缓存）
获取5000条记录（缓存）
获取5000条记录（缓存）
获取5000条记录（缓存）
获取5000条记录（缓存）
获取1000条记录（缓存）
获取1000条记录（缓存）
获取1000条记录（使用缓存光标工作）

3）现在我们处于目标结果偏移的“邻域”，您可以开始在目的地之前指定页面大小。 然后执行最终搜索以获得实际的结果页面。

4）如果您在索引中添加或删除文档，则需要一种机制来使先前的缓存结果无效。 我通过存储上次更新索引的时间戳并将其用作缓存密钥生成例程的一部分来完成此操作。

重要的是缓存方面，您应该构建一个使用请求数组作为缓存哈希键的缓存机制，以便可以轻松创建/引用它。

对于非种子缓存，这种方法是缓慢的，但是如果你可以预热缓存，只有当索引文档发生变化（然后再次热身）时它才会过期，你的用户将无法分辨。

这个代码的想法每页有20个项目，我很乐意研究这个问题，看看我如何更智能/更高效地编写代码，但这个概念是......

// Build $request here and set $request['start'] to be the offset you want to reach

// Craft getCache() and setCache() functions or methods for cache handling.

// have $cloudSearchClient as your client

if(isset($request['start']) === true and $request['start'] >= 10000)
{
  $originalRequest = $request;
  $cursorSeekTarget = $request['start'];
  $cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
  $cursorSeekOffset = 0;
  $request['return'] = '_no_fields';
  $request['cursor'] = 'initial';
  unset($request['start'],$request['facet']);
  // While there is outstanding work to be done...
  while( $cursorSeekAmount > 0 )
  {
    $request['size'] = $cursorSeekAmount;
    // first hit the local cache
    if(empty($result = getCache($request)) === true)
    {
      $result = $cloudSearchClient->Search($request);
      // store the results in the cache
      setCache($request,$result);
    }
    if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
    {
      // prepare the next request with the cursor
      $request['cursor'] = $hits['cursor'];
    }
    $cursorSeekOffset = $cursorSeekOffset + $request['size'];
    if($cursorSeekOffset >= $cursorSeekTarget)
    {
      $cursorSeekAmount = 0; // Finished, no more work
    }
    // the first request needs to get 10k, but after than only get 5K
    elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
    {
      $cursorSeekAmount = 5000;
    }
    elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
    {
      $cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
      // if we still need to seek more than 5K records, limit it back again to 5K
      if($cursorSeekAmount > 5000)
      {
        $cursorSeekAmount = 5000;
      }
      // if we still need to seek more than 1K records, limit it back again to 1K
      elseif($cursorSeekAmount > 1000)
      {
        $cursorSeekAmount = 1000;
      }
    }
  }
  // Restore aspects of the original request (the actual 20 items)
  $request['size'] = 20;
  $request['facet'] = $originalRequest['facet'];
  unset($request['return']); // get the default returns
  if(empty($result = getCache($request)) === true)
  {
    $result = $cloudSearchClient->Search($request);
    setCache($request,$result);
  }
}
else
{
  // No cursor required
  $result = $cloudSearchClient->Search( $request );
}

请注意，这是使用自定义AWS客户端而非官方SDK类完成的，但请求和搜索结构应具有可比性。

AWS SDK CloudSearch分页

问题描述

1 个解决方案

解决方案1
0 2015-06-05 12:57:52

AWS SDK CloudSearch分页

问题描述

1 个解决方案

解决方案1 0 2015-06-05 12:57:52

解决方案1
0 2015-06-05 12:57:52