简体   繁体   中英

Elasticsearch: getting the tf-idf of every term in a given document

I have a document in my elasticsearch with the following id: AVosj8FEIaetdb3CXpP- I'm trying to access for every words in the fields it's tf-idf I did the following:

GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
  "fields" : ["author_wording"],
  "term_statistics" : true,
  "field_statistics" : true
}'

The response I've got is:

{
  "_index": "dailystormer",
  "_type": "dailystormer_article",
  "_id": "AVosj8FEIaetdb3CXpP-",
  "_version": 3,
  "found": true,
  "took": 1,
  "term_vectors": {
    "author_wording": {
      "field_statistics": {
        "sum_doc_freq": 3408583,
        "doc_count": 16111,
        "sum_ttf": 7851321
      },
      "terms": {
        "318": {
          "doc_freq": 4,
          "ttf": 4,
          "term_freq": 1,
          "tokens": [
            {
              "position": 121,
              "start_offset": 688,
              "end_offset": 691
            }
          ]
        },
        "742": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 122,
              "start_offset": 692,
              "end_offset": 695
            }
          ]
        },
        "9971": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 123,
              "start_offset": 696,
              "end_offset": 700
            }
          ]
        },
        "a": {
          "doc_freq": 14921,
          "ttf": 163268,
          "term_freq": 11,
          "tokens": [
            {
              "position": 1,
              "start_offset": 13,
              "end_offset": 14
            },
            ...
            "you’re": {
          "doc_freq": 1112,
          "ttf": 1647,
          "term_freq": 1,
          "tokens": [
            {
              "position": 80,
              "start_offset": 471,
              "end_offset": 477
            }
          ]
        }
      }
    }
  }
}

It returns me some interesting fields like the term frequency (tf) but not the tf-idf. Should I recompute it myself? Is that a good idea? How can I do so?

Yes, it returns you a tf - term frequency (you had both term frequency for this field, and ttf - which is total term frequency, eg sum of all tf's across all fields) and df - document frequency (you also had it in the response). You need to decide which tf-idf you want to calculate across only your field, or all fields. To compute tf-idf you need to do the following:

tf-idf = tf * idf

where

idf = log (N / df)

and N = doc_count from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself.

You can use this API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

{
   "_index": "imdb",
   "_type": "_doc",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}

term_freq - term frequency. The number times a term appears in a field in one specific document.

doc_freq - document frequency. The number of documents a term appears in.

ttf - total term frequency. The number of times this term appears in all documents, that is, the sum of tf over all documents. Computed per field.

df and ttf are computed per shard and therefore these numbers can vary depending on the shard the current document resides in.

How are the scores calculated?

The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in foreground and background sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.

Remember these definitions:

cluster – An Elasticsearch cluster consists of one or more nodes and is identifiable by its cluster name.

node – A single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine.

index – In Elasticsearch, an index is a collection of documents.

shard – Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances the shards as necessary, so users need not worry about the details.

replica – By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.

Allocating multiple shards and replicas is the essence of the design for distributed search capability, providing for high availability and quick access in searches against the documents within an index. The main difference between a primary and a replica shard is that only the primary shard can accept indexing requests. Both replica and primary shards can serve querying requests.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM