简体   繁体   English

如何自定义Apache Nutch 2.3生成步骤

[英]How to customize Apache Nutch 2.3 generate step

I want Nutch to select specific URLs according to my own rules. 我希望Nutch根据我自己的规则选择特定的URL。 This step is done at generate time. 此步骤在生成时完成。 I know how to write parser/indexer plugin. 我知道如何编写解析器/索引器插件。 But How to do it at generate time. 但是如何在生成时间做到这一点。 My Nutch version is 2.3 series 我的Nutch版本是2.3系列

The Nutch generator is not really an extension point in Nutch, so you are not able of writing plugins to customize it. Nutch生成器实际上不是Nutch的扩展点,因此您无法编写插件对其进行自定义。 Nevertheless, nothing stops you from writing your own generator with your own logic. 但是,没有什么可以阻止您使用自己的逻辑编写自己的生成器。

You would need to adjust the bin/nutch and bin/crawl scripts in order to call your own generator instead of the default one. 您需要调整bin/nutchbin/crawl脚本,以便调用自己的生成器,而不是默认的生成器。 Keep in mind that some other parts of Nutch rely on some parts of the generator implementation ( SegmentMerger for instance). 请记住,Nutch的其他某些部分依赖于生成器实现的某些部分(例如SegmentMerger )。 If you customize these parts, then you'll need to update some other classes as well. 如果自定义这些部分,则还需要更新其他一些类。

The generator uses the ScoringFilter.generatorSortValue() method when is deciding which elements to return. 在确定要返回哪些元素时,生成器使用ScoringFilter.generatorSortValue()方法。 So, this is one alternative that doesn't require changing the generator. 因此,这是一种不需要更改生成器的替代方法。

Side note, this is not entirely uncommon to do, I've seemed some clients requiring customized generators. 旁注,这并非完全不常见,我似乎有些客户需要定制的生成器。

As suggested by Jorge, you could write a scoringfilter to assign scores to pages based on your own logic and filter during the generation step based on that. 正如Jorge所建议的那样,您可以编写一个scoringfilter来根据自己的逻辑为页面分配分数,并在生成步骤的基础上进行过滤。 Alternatively, if by chance your selection rules can be determined based on the URL alone, you could have a bespoke URL normaliser used with a scope of generate (or whatever the value is) which would rewrite the URLs into something that the URL filters would then discard. 或者,如果偶然地可以仅根据URL确定选择规则,则可以使用定制的URL规范化程序,其范围为generate(或任何值),它将URL重写为URL过滤器可以使用的内容丢弃。 You'd need to activate the filtering as part of the generate step. 您需要在生成步骤中激活过滤。 This is an ugly hack. 这是一个丑陋的骇客。

Nutch 2.x is really awkward and I am not sure you could create a copy of your table based on a filter of the original one. Nutch 2.x确实很尴尬,我不确定您是否可以基于原始表的过滤器创建表的副本。

What Gora backend do you use? 您使用什么Gora后端?

StormCrawler is a lot more flexible for this and we've recently added a mechanism for filtering URLs at the spout level , which is exactly what you'd need. StormCrawler对此更加灵活,我们最近添加了一种在spout级别过滤URL的机制 ,这正是您所需要的。 You could do a similar thing in Nutch 2.x but that would probably mean changing things in GORA as well. 您可以在Nutch 2.x中执行类似的操作,但这可能意味着也需要在GORA中进行更改。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM