简体   繁体   English

从Nutch 1.x将数据映射到Elasticsearch

[英]Mapping data into Elasticsearch from Nutch 1.x

I've been working with Nutch 1.10 to make some small web crawls and indexing the crawl data using Elasticsearch 1.4.1 - it seems that the only way to optimize the index mapping is to crawl first, review the mapping that ES did on its own and then change it accordingly (if necessary) with the mapping API. 我一直在与Nutch 1.10一起进行一些小型Web爬网,并使用Elasticsearch 1.4.1编制爬网数据的索引-看来,优化索引映射的唯一方法是首先进行爬网,回顾一下ES自己进行的映射然后使用映射API进行相应的更改(如有必要)。

Does anyone know of a more effective solution to optimize the mappings within an ES index for web crawling? 有谁知道一种更有效的解决方案来优化ES索引内的Web爬网映射?

UPDATE: Is it even possible to update an ES mapping from a Nutch web crawl? 更新:甚至有可能从Nutch Web爬网更新ES映射吗?

There are two things to consider here: 这里有两件事要考虑:

  1. What is the data that is Indexed? 被索引的数据是什么?
  2. How to index it correctly to es 如何将其正确索引到es

Regarding the indexed data, the index-plugins you use affect this. 关于索引数据,您使用的索引插件会对此产生影响。 For example, the basic-index will add content , host , url , etc. for every doc . 例如,基本索引将为每个doc添加内容主机URL You can either check the plugins' documentation or to simply see what is the output (like you did). 您可以查看插件的文档,也可以简单地查看输出是什么(就像您所做的一样)。

After you know the indexed data and how to you want to approach it in the es cluster, you can create a new index in es with the correct/ optimized mappings, and make sure Nutch will index to that index. 了解索引数据以及如何在es群集中使用它们之后,您可以在es中使用正确/优化的映射创建一个新索引,并确保Nutch将对该索引建立索引。

Of course you can also re-index what you already crawled (see this es article ). 当然,您还可以重新索引已爬网的内容(请参阅本文 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM