简体   繁体   中英

Mapping data into Elasticsearch from Nutch 1.x

I've been working with Nutch 1.10 to make some small web crawls and indexing the crawl data using Elasticsearch 1.4.1 - it seems that the only way to optimize the index mapping is to crawl first, review the mapping that ES did on its own and then change it accordingly (if necessary) with the mapping API.

Does anyone know of a more effective solution to optimize the mappings within an ES index for web crawling?

UPDATE: Is it even possible to update an ES mapping from a Nutch web crawl?

There are two things to consider here:

  1. What is the data that is Indexed?
  2. How to index it correctly to es

Regarding the indexed data, the index-plugins you use affect this. For example, the basic-index will add content , host , url , etc. for every doc . You can either check the plugins' documentation or to simply see what is the output (like you did).

After you know the indexed data and how to you want to approach it in the es cluster, you can create a new index in es with the correct/ optimized mappings, and make sure Nutch will index to that index.

Of course you can also re-index what you already crawled (see this es article ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM