简体   繁体   English

ElasticSearch:仅索引映射中指定的字段

[英]ElasticSearch: Index only the fields specified in the mapping

I have an ElasticSearch setup, receiving data to index via a CouchDB river. 我有一个ElasticSearch设置,通过CouchDB河接收数据到索引。 I have the problem that most of the fields in the CouchDB documents are actually not relevant for search: they are fields internally used by the application (IDs and so on), and I do not want to get false positives because of these fields. 我有一个问题,CouchDB文档中的大多数字段实际上与搜索无关:它们是应用程序内部使用的字段(ID等),我不希望因为这些字段而得到误报。 Besides, indexing not needed data seems to me a waste of resources. 此外,索引不需要的数据在我看来是浪费资源。

To solve this problem, I have defined a mapping where I specify the fields which I want to be indexed. 为了解决这个问题,我已经定义了一个映射,我在其中指定了我想要编入索引的字段。 I am using pyes to access ElasticSearch. 我使用pyes访问ElasticSearch。 The process that I follow is: 我遵循的过程是:

  1. Create the CouchDB river, associated to an index. 创建与索引关联的CouchDB河。 This apparently creates also the index, and creates a "couchdb" mapping in that index which, as far as I can see, includes all fields, with dynamically assigned types. 这显然也创建了索引,并在该索引中创建了一个“couchdb”映射,据我所知,该映射包括所有具有动态分配类型的字段。
  2. Put a mapping, restring it to the fields which I really want to index. 放置映射,将其重新绑定到我真正想要索引的字段。

This is the index definition as obtained by: 这是通过以下方式获得的索引定义:

curl -XGET http://localhost:9200/notes_index/_mapping?pretty=true

{
  "notes_index" : {
    "default_mapping" : {
      "properties" : {
        "note_text" : {
          "type" : "string"
        }
      }
    },
    "couchdb" : {
      "properties" : {
        "_rev" : {
          "type" : "string"
        },
        "created_at_date" : {
          "format" : "dateOptionalTime",
          "type" : "date"
        },
        "note_text" : {
          "type" : "string"
        },
        "organization_id" : {
          "type" : "long"
        },
        "user_id" : {
          "type" : "long"
        },
        "created_at_time" : {
          "type" : "long"
        }
      }
    }
  }
}

The problem that I have is manyfold: 我遇到的问题有很多:

  • that the default "couchdb" mapping is indexing all fields. 默认的“couchdb”映射正在索引所有字段。 I do not want this. 我不想要这个。 Is it possible to avoid the creation of that mapping? 是否有可能避免创建该映射? I am confused, because that mapping seems to be the one which is somehow "connecting" to the CouchDB river. 我很困惑,因为那个映射似乎是以某种方式“连接”到CouchDB河的那个。
  • the mapping that I create seems not to have any effect: there are no documents indexed by that mapping 我创建的映射似乎没有任何影响:没有该映射索引的文档

Do you have any advice on this? 你对此有什么建议吗?

EDIT 编辑

This is what I am actually doing, exactly as typed: 这就是我实际做的,与输入完全一样:

server="localhost"

# Create the index
curl -XPUT    "$server:9200/index1"

# Create the mapping
curl -XPUT    "$server:9200/index1/mapping1/_mapping" -d '
{
    "type1" : {
        "properties" : {
            "note_text" : {"type" : "string", "store" : "no"}
        }
    }
}
'

# Configure the river
curl -XPUT "$server:9200/_river/river1/_meta" -d '{
    "type" : "couchdb",
    "couchdb" : {
        "host" : "localhost",
        "port" : 5984,
        "user" : "admin",
        "password" : "admin",
        "db" : "notes"
    },
    "index" : {
        "index" : "index1",
        "type" : "type1"
    }
}'

The documents in index1 still contain fields other than "note_text", which is the only one that I have specifically mentioned in the mapping definition. index1中的文档仍然包含“note_text”以外的字段,这是我在映射定义中特别提到的唯一字段。 Why is that? 这是为什么?

The default behavior of CouchDB river is to use a 'dynamic' mapping, ie index all the fields that are found in the incoming CouchDB documents. CouchDB河的默认行为是使用“动态”映射,即索引在传入的CouchDB文档中找到的所有字段。 You're right that it can unnecessarily increase the size of the index (your problems with search can be solved by excluding some fields from the query). 你是对的,它可以不必要地增加索引的大小(你可以通过从查询中排除一些字段来解决搜索问题)。

To use your own mapping instead of the 'dynamic' one, you need to configure the River plugin to use the mapping you've created (see this article ): 要使用您自己的映射而不是“动态”映射,您需要配置River插件以使用您创建的映射(请参阅此文章 ):

curl -XPUT 'elasticsearch-host:9200/_river/notes_index/_meta' -d '{
    "type" : "couchdb",

    ... your CouchDB connection configuration ...

    "index" : {
        "index" : "notes_index",
        "type" : "mapping1"
    }
}'

The name of the type that you're specifying in URL while doing mapping PUT overrides the one that you're including in the definition, so the type that you're creating is in fact mapping1 . 您在执行映射时在URL中指定的类型的名称PUT会覆盖您在定义中包含的类型,因此您创建的类型实际上是mapping1 Try executing this command to see for yourself: 尝试执行此命令以查看自己:

> curl 'localhost:9200/index1/_mapping?pretty=true'

{
  "index1" : {
    "mapping1" : {
      "properties" : {
        "note_text" : {
          "type" : "string"
        }
      }
    }
  }
}

I think that if you will get the name of type right, it will start working fine. 我认为,如果你得到类型的名称,它将开始正常工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM