[英]How to index documents containing nested properties with Lucene?
I'll try to reduce my case to the necessary: I'm building a Webapp (with Spring
) with a search interface that lets you search a corpus of annotated/tagged texts. 我将尝试将情况简化为必要的情况:我正在使用搜索界面构建Webapp(使用
Spring
),该界面可让您搜索带注释/标记文本的语料库。 In my DB ( MongoDB
) one document represents one page of a book collection (totaling ~8000 pages). 在我的数据库(
MongoDB
)中,一个文档代表一个藏书集的一页(总计约8000页)。
Here is an example of the Document structure in JSON (I removed a lot of meta data for brevity. Also, and this is important, the "tokens"-array contains up to 700 objects in most cases.): 这是JSON中Document结构的示例(为简洁起见,我删除了很多元数据。而且,这一点很重要,在大多数情况下,“令牌”数组最多包含700个对象。):
{
"_id" : ObjectId("5622c29eef86d3c2f23fd62c"),
"scanId" : "592ea208b6d108ee5ae63f79",
"volume" : "Volume I",
"chapters" : [
"Some Chapter Name"
],
"languages" : [
"English",
"German"
],
"tokens" : [
{
"form" : "The",
"index" : 0,
"tags" : [
"ART"
]
},
{
"form" : "house",
"index" : 1,
"tags" : [
"NN",
"NN_P"
]
},
{
"form" : "is",
"index" : 2,
"tags" : [
"V",
"CONJ_C"
]
}
]
}
So you see i don't have a plain text, here. 所以你看我这里没有纯文本。 I now want to build an index with Lucene to quickly search this DB.
我现在想用Lucene建立一个索引来快速搜索该数据库。 The problem is that i want to be able to search certain words, their tags AND the context around it.
问题是我希望能够搜索某些单词,它们的标签以及它周围的上下文。 Like "give me all documents containing the word 'House' tagged as 'NN' followed by a word tagged with 'V'.".
就像“给我所有包含单词'House'标记为'NN',然后单词'V'的文档”。 I couldn't find a way to index these sub-structures with native Lucene functionality.
我找不到用本地Lucene功能为这些子结构建立索引的方法。
What i tried to do to at least be able to search for words and their tags is the following: In my Lucene index, a document doesn't represent a whole page, but only a word/token with it's tags. 我试图至少能够搜索单词及其标签的方法如下:在我的Lucene索引中,文档不代表整个页面,而仅代表单词/标记及其标签。 So one index document looks like this (expressed in JSON syntax for readability):
因此,一个索引文档如下所示(为了便于阅读,以JSON语法表示):
{
"token" : "house",
"tag" : "NN",
"tag" : "NN_P",
"index" : 1,
"pageId" : "5622c29eef86d3c2f23fd62c"
}
... Yes, Lucene allows me to use one field multiple times. ...是的,Lucene允许我多次使用一个字段。 So now i can search for a word and it's tags and get a reference to the page object in my DB via it's ID.
因此,现在我可以搜索一个单词及其标签,并通过它的ID获取对DB对象的引用。 But this is pretty ugly for two reasons: I now have two completely different document representations (DB and Lucene index) and to process a complex query like the one i mentioned above i'd have to query for the word and it's tag and then further check the context of the hits in the retrieved documents manually.
但这很丑陋,原因有两个:我现在有两个完全不同的文档表示形式(DB和Lucene索引),并且要像我上面提到的那样处理一个复杂的查询,我必须先查询单词及其标签,然后再查询手动检查检索到的文档中的匹配内容。
So my question is: Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties? 所以我的问题是:有没有一种方法可以在Lucene中索引包含字段/属性的文档,这些字段/属性的值是嵌套对象,这些嵌套对象又具有某些属性?
Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties?
有没有一种方法可以在Lucene中索引包含字段/属性的文档,这些字段/属性的值是嵌套对象,而嵌套对象又具有某些属性?
Elasticsearch certainly lets you do this. Elasticsearch当然可以让您做到这一点。 I think it's possible to do all of it in pure lucene, but may be some effort.
我认为有可能在纯Lucene中进行所有操作,但可能需要一些努力。
Basically, you need to use the 'nested' query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html 基本上,您需要使用“嵌套”查询: https : //www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html
PUT /my_index
{
"mappings": {
"type1" : {
"properties" : {
"tokens" : {
"type" : "nested"
}
}
}
}
}
This tells ES to index the contents of this field as a list of separate documents, allowing you to query them individually using the 'nested' query: 这告诉ES将该字段的内容索引为单独文档的列表,从而使您可以使用“嵌套”查询分别查询它们:
GET my_index/_search
{
"query": {
"nested": {
"path": "tokens",
"query": {
"bool": {
"must": [
{ "match": { "tokens.form": "house" }},
{ "match": { "tokens.tags": "NN" }}
]
}
}
}
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.