简体繁体 English

solr + Heritrix

[英]solr + Heritrix

原文 2009-11-03 03:37:15 4 4 search/ indexing/ search-engine/ solr/ web-crawler

How is it possible to integrate solr with heritrix? 如何将solr与heritrix集成？

I want to archive a site using heritrix and then index and search locally this file using solr. 我想使用heritrix归档网站，然后使用solr在本地索引和搜索此文件。

Thanks 谢谢

4 个解决方案

The problem with using Solr to index is that it is a straight text index (which may be fine if you are only crawling an internal website and don´t care about 'pagerank'). 使用Solr索引的问题在于它是一个直接的文本索引（如果你只是抓取一个内部网站而不关心'pagerank'，这可能没问题）。

Using Nutch will give you a much better index however as it does use pagerank. 使用Nutch将为您提供更好的索引，因为它确实使用了pagerank。

NutchWAX NutchWAX

If however you are deadset on using Heritrix and would like pagerank based search results you could use NutchWAX (Nutch Web Archive eXtensions) to index Heritrix's output (that's what the makers of Heritrix are doing). 但是，如果您对使用Heritrix不感兴趣并想要基于pagerank的搜索结果，您可以使用NutchWAX （Nutch Web Archive eXtensions）来索引Heritrix的输出（这就是Heritrix的制造商正在做的事情）。

NutchWAX is intended for web archives but can also be used to create a search engine of the live web (in fact that is easier as you aren't dragging years worth of data along during each rebuild of the index). NutchWAX旨在用于Web存档，但也可用于创建实时Web的搜索引擎（事实上，这更容易，因为您不会在每次重建索引期间拖动多年的数据）。

Solr Solr的

If you do want to use Heritrix+Solr to create a search website, you should probably replace the "ARCWriter" processor in Heritrix with a custom processor that submits the contents of the page to Solr. 如果您确实想使用Heritrix + Solr创建搜索网站，您应该使用自定义处理器替换Heritrix中的“ARCWriter”处理器，该处理器将页面内容提交给Solr。

The Solr end is just an XML file posted via HTTP and is dead simple. Solr端只是一个通过HTTP发布的XML文件，并且很简单。

The Heritrix end is little bit more complicated, but the Developer's Manual will get you started on writing a Processor for Heritrix 1.x (if you are using the --as yet-- unstable 3.x -- or discontinued 2.x -- you'll need to do a little more legwork as the documentation isn't there yet.). Heritrix的结束有点复杂，但开发人员手册将帮助您开始为Heritrix 1.x编写处理器（如果您使用的是--as yet--不稳定的3.x - 或已停止2.x - - 你还需要做更多的工作，因为文档还没有。）

There is a section in the Solr 1.4 Enterprise Search book about using Heritrix and Solr together. Solr 1.4企业级搜索一书中有一节介绍了如何一起使用Heritrix和Solr。 Basically use Heritrix to crawl, and then in a seperate process parse the archive files and add them Solr. 基本上使用Heritrix进行爬网，然后在单独的进程中解析归档文件并将其添加到Solr中。 While you loose out on things like page rank scores that Nutch provides, it does simplify things because your crawler and your search engine are separate tools. 虽然你对Nutch提供的页面排名得分等事情感到宽容，但它确实简化了事情，因为你的爬虫和你的搜索引擎是独立的工具。

This is basically the approach that Mauricio uses, storing data into MySQL as an intermediate step. 这基本上是Mauricio使用的方法，将数据作为中间步骤存储到MySQL中。 We published all the source for the book on an Amazon EC2 AMI, look for "solrbook". 我们在Amazon EC2 AMI上发布了该书的所有来源，寻找“solrbook”。 Also, the support site at Packt ( http://www.packtpub.com/solr-1-4-enterprise-search-server ) will let you download the sample. 此外，Packt的支持站点（ http://www.packtpub.com/solr-1-4-enterprise-search-server ）将允许您下载示例。

For the same purpose I used youseer. 出于同样的目的，我使用了你。

First download YouSeer.jar and then, 首先下载YouSeer.jar，然后，

java -jar YouSeer.jar http://localhost:8983/solr/update /cygdrive/d/arcs /cached 3 0

It internally uses the ArcReader to read documents and then upload them to Solr. 它在内部使用ArcReader读取文档，然后将它们上传到Solr。 The YouSeer code is fairly simple and I had to modify a bit for my purposes.. YouSeer代码非常简单，我不得不为我的目的修改一下。

According to this message , yes: 根据这条消息，是的：

It is pretty easy to add custom writers to Heritrix. 将自定义编写器添加到Heritrix非常容易。 We write our crawls to MySQL and then ingest into Solr from there. 我们将爬行写入MySQL，然后从那里摄取到Solr。 It would not be hard to write a Heritrix writer that writes directly to Solr however. 然而，编写一个直接写入Solr的Heritrix编写器并不难。

-- Sean Timm - 肖恩蒂姆