Elasticsearch：獲取給定文檔中每個術語的tf-idf

Question

我的彈性搜索中有一個文件，其中包含以下ID： AVosj8FEIaetdb3CXpP-我正在嘗試訪問字段中的每個單詞，這是tf-idf我執行了以下操作：

GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
  "fields" : ["author_wording"],
  "term_statistics" : true,
  "field_statistics" : true
}'

我得到的回應是：

{
  "_index": "dailystormer",
  "_type": "dailystormer_article",
  "_id": "AVosj8FEIaetdb3CXpP-",
  "_version": 3,
  "found": true,
  "took": 1,
  "term_vectors": {
    "author_wording": {
      "field_statistics": {
        "sum_doc_freq": 3408583,
        "doc_count": 16111,
        "sum_ttf": 7851321
      },
      "terms": {
        "318": {
          "doc_freq": 4,
          "ttf": 4,
          "term_freq": 1,
          "tokens": [
            {
              "position": 121,
              "start_offset": 688,
              "end_offset": 691
            }
          ]
        },
        "742": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 122,
              "start_offset": 692,
              "end_offset": 695
            }
          ]
        },
        "9971": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 123,
              "start_offset": 696,
              "end_offset": 700
            }
          ]
        },
        "a": {
          "doc_freq": 14921,
          "ttf": 163268,
          "term_freq": 11,
          "tokens": [
            {
              "position": 1,
              "start_offset": 13,
              "end_offset": 14
            },
            ...
            "you’re": {
          "doc_freq": 1112,
          "ttf": 1647,
          "term_freq": 1,
          "tokens": [
            {
              "position": 80,
              "start_offset": 471,
              "end_offset": 477
            }
          ]
        }
      }
    }
  }
}

它返回了一些有趣的字段，如術語頻率（tf），但不是tf-idf。 我應該自己重新計算嗎？ 這是一個好主意嗎？ 我怎么能這樣做？

Answer 1

是的，它會返回一個tf - 術語頻率（你有這個字段的兩個術語頻率，ttf - 這是總術語頻率，例如所有字段中所有tf的總和）和df - 文檔頻率（你也有它在響應）。 您需要確定只想在您的字段或所有字段中計算哪個tf-idf。 要計算tf-idf，您需要執行以下操作：

tf-idf = tf * idf

哪里

idf = log (N / df)

你的回復中有N = doc_count 。 Elasticsearch不提供計算tf-idf的實現，因此您需要自己完成。

Answer 2

您可以使用此API：

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

{
   "_index": "imdb",
   "_type": "_doc",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}

term_freq - 術語頻率。 術語出現在一個特定文檔的字段中的次數。

doc_freq - 文檔頻率。 一個術語出現的文檔數量。

ttf - 總學期頻率。 此術語在所有文檔中出現的次數，即所有文檔的tf總和。 按字段計算。

每個分片計算df和ttf，因此這些數字可能會根據當前文檔所在的分片而有所不同。

分數是如何計算的？

為分數返回的數字主要用於明智地排列不同的建議，而不是最終用戶容易理解的內容。 得分來自前景和背景集中的doc頻率。 簡而言之，如果術語出現在子集和背景中的頻率存在明顯差異，則術語被認為是重要的。 可以配置術語排名的方式，請參閱“參數”部分。

記住這些定義：

cluster - Elasticsearch集群由一個或多個節點組成，可通過其集群名稱進行標識。

node - 單個Elasticsearch實例。 在大多數環境中，每個節點都在單獨的盒子或虛擬機上運行。

index - 在Elasticsearch中，索引是文檔的集合。

shard - 因為Elasticsearch是一個分布式搜索引擎，索引通常會拆分為分布在多個節點上的稱為分片的元素。 Elasticsearch自動管理這些分片的排列。 它還根據需要重新平衡分片，因此用戶無需擔心細節。

副本 - 默認情況下，Elasticsearch為每個索引創建五個主分片和一個副本。 這意味着每個索引將包含五個主分片，每個分片將具有一個副本。

分配多個分片和副本是分布式搜索功能設計的本質，提供高可用性和快速訪問索引中的文檔。 主副本和副本分片之間的主要區別在於只有主分片可以接受索引請求。 副本和主分片都可以提供查詢請求。

Elasticsearch：獲取給定文檔中每個術語的tf-idf

問題描述

2 個解決方案

解決方案1
5 已采納 2017-02-14 13:24:05

解決方案2
2 2018-07-06 12:46:53

Elasticsearch：獲取給定文檔中每個術語的tf-idf

問題描述

2 個解決方案

解決方案1 5 已采納 2017-02-14 13:24:05

解決方案2 2 2018-07-06 12:46:53

解決方案1
5 已采納 2017-02-14 13:24:05

解決方案2
2 2018-07-06 12:46:53