简体   繁体   English

Lucene索引HTML文档

[英]Lucene indexing html documents

I would like to index 1 million of html documents in Lucene. 我想在Lucene中索引100万个html文档。 I need to index in one Lucene document several html files. 我需要在一个Lucene文档中索引几个html文件。 Lately, I would like to know in the search response the original html document. 最近,我想在搜索响应中知道原始的html文档。

So, for example I have: 因此,例如,我有:

1.home.html
2.history.html
3.about.html

4.home2.html
...

I want to index 1, 2 and 3 in the same Lucene document. 我想在同一Lucene文档中索引1、2和3。 Then, if I search any text I want to know the original document (home, history or about). 然后,如果我搜索任何文本,我想知道原始文档(家庭,历史或大约)。

I have been searching in Internet and I found Lucene payload . 我一直在互联网上搜索,发现Lucene负载 So I have been thinking about indexing the url of the original document in all the terms. 因此,我一直在考虑在所有术语中为原始文档的url编制索引。 Is this a good solution? 这是一个好的解决方案吗? the performance would be allright? 表演还可以吗?

Thanks very much for your help. 非常感谢您的帮助。

I think what you need is Apache Solr http://lucene.apache.org/solr/ , its uses Lucene as indexing engine and has querying / web interface for searching. 我认为您需要的是Apache Solr http://lucene.apache.org/solr/ ,它使用Lucene作为索引引擎,并具有查询/ web界面进行搜索。

look at this tutorial on the site http://lucene.apache.org/solr/4_3_1/tutorial.html 请在以下网站上查看本教程: http://lucene.apache.org/solr/4_3_1/tutorial.html

I have been working two days on this problem and I think I found the solution. 我已经在这个问题上工作了两天,我想我找到了解决方案。

I index every html page in one document using an ID like for example: 我使用ID将一个文档中的每个html页面编入索引,例如:

1.home.html     id1  htmlcontent
2.history.html  id1  htmlcontent
3.about.html    id1  htmlcontent

4.home2.html    id2  htmlcontent
...

Lately I can make use org.apache.lucene.search.grouping to group the results by this id. 最近,我可以使用org.apache.lucene.search.grouping将该结果按此ID分组。

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

Hope this helps anybody :) 希望这对任何人都有帮助:)

They are two different lucene features: 它们是两个不同的lucene功能:

1.Grouping : it allows to group search results by specified field. 1.分组:允许按指定字段对搜索结果进行分组。 For example, if you group by the author field, then all documents with the same value in the author field fall into a single group. 例如,如果按“作者”字段分组,那么“作者”字段中具有相同值的所有文档将归为一个组。 You will have a kind of tree as output. 您将有一种树作为输出。

http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html http://lucene.apache.org/core/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html

2.facet: this feature doesn't group documents, it just tells you how many documents fall in a specific value of a facet. 2.facet:此功能不会将文档分组,它只是告诉您有多少文档属于某个facet的特定值。 For example, if you have a facet based on the author field, you will receive a list of all your authors, and for each author you will know how many documents belong to that specific author. 例如,如果您有一个基于作者字段的构面,那么您将收到所有作者的列表,并且对于每个作者,您将知道该特定作者属于多少个文档。 After, if you want to see those documents, you have to query one more time adding a specific filter (author=whatever). 之后,如果要查看这些文档,则必须再次查询一次,以添加特定的过滤器(作者=任意)。 The faceted search is in fact based on browsing documents applying multiple filters to progressively reach the documents you're really interested in. 实际上,分面搜索是基于使用多个过滤器的浏览文档来逐步找到您真正感兴趣的文档。

here is some tutorials 这是一些教程

http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/doc-files/userguide.html http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/doc-files/userguide.html

http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/search/package-summary.html http://lucene.apache.org/core/4_3_1/facet/org/apache/lucene/facet/search/package-summary.html

just go through it and work out as per your needs 只是通过它,并根据您的需要进行锻炼

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM