简体   繁体   中英

How to modify the index template used by the nutch index writer for elasticsearch?

Out of the box the nutch index writer for elasticsearch generates an index in elasticsearch with the name provided in nutch-site.xml (or nutch-default.xml) in the property element:

   <property> 
     <name>elastic.index</name>
     <value>nutch</value> 
     <description>Default index to send documents to.</description>
   </property>

The mappings section in elasticsearch for such an automatically generated index always has the following structure

   {
       "nutch": {
           "mappings": {
               "doc": {
                   "properties": {
                       "anchor": {
                           "type": "string"
                       },
                       "boost": {
                           "type": "string"
                       },
                       "cache": {
                           "type": "string"
                       },
                       "content": {
                           "type": "string"
                       },
                       "contentLength": {
                           "type": "string"
                       },
                       "date": {
                           "type": "date",
                           "format": "dateOptionalTime"
                       },
                       "digest": {
                           "type": "string"
                       },
                       "host": {
                           "type": "string"
                       },
                       "id": {
                           "type": "string"
                       },
                       "lang": {
                           "type": "string"
                       },
                       "lastModified": {
                           "type": "date",
                           "format": "dateOptionalTime"
                       },
                       "segment": {
                           "type": "string"
                       },
                       "title": {
                           "type": "string"
                       },
                       "tstamp": {
                           "type": "date",
                           "format": "dateOptionalTime"
                       },
                       "type": {
                           "type": "string"
                       },
                       "url": {
                           "type": "string"
                       }
                   }
               }
           }
       }
   }
  1. Where is the template for this?
  2. Can it be changed?
  3. If yes, which fields are mandatory and which are optional?
  4. Where can I find more information on this?

Any help appreciated! Thanks, Wolfram

Welcome to StackOverflow !!

Here's my take at your questions:

  1. It doesn't look like Nutch creates any template. Here is the source code for ElasticIndexWriter and as you can see there's no reference to any template anywhere.

  2. Since Nutch doesn't create any index template, you can't change it... but you can definitely create one yourself directly in your ES cluster, if you want/need to control the mapping of certain fields.

You can start off the default mapping created by Nutch (ie the one you've pasted in your question) and iterate on that. Creating a template out of it is trivial, ie you just add the "template": "nutch*" property (first line below) and you're good to go (some more info available on how to change mappings available here ):

curl -XPUT localhost:9200/_template/nutch_template -d '{
  "template": "nutch*",
  "mappings": {
    "doc": {
      "properties": {
        "anchor": {
          "type": "string"
        },
        "boost": {
          "type": "string"
        },
        "cache": {
          "type": "string"
        },
        "content": {
          "type": "string"
        },
        "contentLength": {
          "type": "string"
        },
        "date": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "digest": {
          "type": "string"
        },
        "host": {
          "type": "string"
        },
        "id": {
          "type": "string"
        },
        "lang": {
          "type": "string"
        },
        "lastModified": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "segment": {
          "type": "string"
        },
        "title": {
          "type": "string"
        },
        "tstamp": {
          "type": "date",
          "format": "dateOptionalTime"
        },
        "type": {
          "type": "string"
        },
        "url": {
          "type": "string"
        }
      }
    }
  }
}'

3-4. There is a description of all the fields indexed/stored by Nutch in their wiki , so you can modify the mapping above in order to store/index certain fields differently to match your exact needs.

Note: make sure to wipe your current nutch index first, then create your template (point 2 above) and then when Nutch will index its first document, the index will be created automatically.

You might also be interested in looking into the issue FLUME-2787 as someone else seems to have gone through template creation himself. You might find some nuggets in there.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM