AWS SDK CloudSearch分頁

Question

我正在使用PHP AWS SDK與CloudSearch進行通信。 根據這篇文章，可以使用cursor或start參數進行分頁。 但是當你有超過10,000次點擊時，你就無法使用start 。

使用start ，我可以指定['start' => 1000, 'size' => 100]直接到第10頁。
如何使用cursor到達第1000頁（或任何其他隨機頁面）？ 也許有辦法計算這個參數？

Answer 1

我會喜歡那里有一個更好的方式但是這里...

我用游標發現的一件事是，當在同一數據集上搜索時，它們為重復的搜索請求返回相同的值，因此不要將它們視為會話。 雖然您的數據沒有更新，但您可以有效地緩存分頁的各個方面以供多個用戶使用。

我想出了這個解決方案並用75,000多條記錄進行了測試。

1）確定您的開始是否將低於10k限制，如果是，則使用非光標搜索，否則在尋找過去的10K時，首先使用initial光標和10K的大小執行搜索並返回_no_fields 。 這給出了我們的起始偏移量，沒有字段加速了我們必須消耗的數據量，我們不需要這些ID

2）找出目標偏移量，並計划將光標放在目標結果頁面之前需要多少次迭代。 然后我使用我的請求迭代並緩存結果作為緩存哈希。

對於我的迭代，我開始使用10K塊然后將大小減小到5k然后減少1k塊，因為我開始越來越接近目標偏移量，這意味着后續分頁使用的前一個光標更接近最后一個塊。

例如，這看起來像是：

獲取10000條記錄（初始光標）
獲取5000條記錄
獲取5000條記錄
獲取5000條記錄
獲取5000條記錄
獲取1000條記錄
獲取1000條記錄

這將幫助我到達32,000偏移標記附近的區塊。 如果我需要達到33,000，我可以使用我的緩存結果來獲取將返回前一個1000並從該偏移量再次開始的光標...

獲取10000條記錄（緩存）
獲取5000條記錄（緩存）
獲取5000條記錄（緩存）
獲取5000條記錄（緩存）
獲取5000條記錄（緩存）
獲取1000條記錄（緩存）
獲取1000條記錄（緩存）
獲取1000條記錄（使用緩存光標工作）

3）現在我們處於目標結果偏移的“鄰域”，您可以開始在目的地之前指定頁面大小。 然后執行最終搜索以獲得實際的結果頁面。

4）如果您在索引中添加或刪除文檔，則需要一種機制來使先前的緩存結果無效。 我通過存儲上次更新索引的時間戳並將其用作緩存密鑰生成例程的一部分來完成此操作。

重要的是緩存方面，您應該構建一個使用請求數組作為緩存哈希鍵的緩存機制，以便可以輕松創建/引用它。

對於非種子緩存，這種方法是緩慢的，但是如果你可以預熱緩存，只有當索引文檔發生變化（然后再次熱身）時它才會過期，你的用戶將無法分辨。

這個代碼的想法每頁有20個項目，我很樂意研究這個問題，看看我如何更智能/更高效地編寫代碼，但這個概念是......

// Build $request here and set $request['start'] to be the offset you want to reach

// Craft getCache() and setCache() functions or methods for cache handling.

// have $cloudSearchClient as your client

if(isset($request['start']) === true and $request['start'] >= 10000)
{
  $originalRequest = $request;
  $cursorSeekTarget = $request['start'];
  $cursorSeekAmount = 10000; // first one should be 10K since there's no pagination under this
  $cursorSeekOffset = 0;
  $request['return'] = '_no_fields';
  $request['cursor'] = 'initial';
  unset($request['start'],$request['facet']);
  // While there is outstanding work to be done...
  while( $cursorSeekAmount > 0 )
  {
    $request['size'] = $cursorSeekAmount;
    // first hit the local cache
    if(empty($result = getCache($request)) === true)
    {
      $result = $cloudSearchClient->Search($request);
      // store the results in the cache
      setCache($request,$result);
    }
    if(empty($result) === false and empty( $hits = $result->get('hits') ) === false and empty( $hits['hit'] ) === false )
    {
      // prepare the next request with the cursor
      $request['cursor'] = $hits['cursor'];
    }
    $cursorSeekOffset = $cursorSeekOffset + $request['size'];
    if($cursorSeekOffset >= $cursorSeekTarget)
    {
      $cursorSeekAmount = 0; // Finished, no more work
    }
    // the first request needs to get 10k, but after than only get 5K
    elseif($cursorSeekAmount >= 10000 and ($cursorSeekTarget - $cursorSeekOffset) > 5000)
    {
      $cursorSeekAmount = 5000;
    }
    elseif(($cursorSeekOffset + $cursorSeekAmount) > $cursorSeekTarget)
    {
      $cursorSeekAmount = $cursorSeekTarget - $cursorSeekOffset;
      // if we still need to seek more than 5K records, limit it back again to 5K
      if($cursorSeekAmount > 5000)
      {
        $cursorSeekAmount = 5000;
      }
      // if we still need to seek more than 1K records, limit it back again to 1K
      elseif($cursorSeekAmount > 1000)
      {
        $cursorSeekAmount = 1000;
      }
    }
  }
  // Restore aspects of the original request (the actual 20 items)
  $request['size'] = 20;
  $request['facet'] = $originalRequest['facet'];
  unset($request['return']); // get the default returns
  if(empty($result = getCache($request)) === true)
  {
    $result = $cloudSearchClient->Search($request);
    setCache($request,$result);
  }
}
else
{
  // No cursor required
  $result = $cloudSearchClient->Search( $request );
}

請注意，這是使用自定義AWS客戶端而非官方SDK類完成的，但請求和搜索結構應具有可比性。

AWS SDK CloudSearch分頁

問題描述

1 個解決方案

解決方案1
0 2015-06-05 12:57:52

AWS SDK CloudSearch分頁

問題描述

1 個解決方案

解決方案1 0 2015-06-05 12:57:52

解決方案1
0 2015-06-05 12:57:52