How to index documents containing nested properties with Lucene?

Question

I'll try to reduce my case to the necessary: I'm building a Webapp (with Spring ) with a search interface that lets you search a corpus of annotated/tagged texts. In my DB ( MongoDB ) one document represents one page of a book collection (totaling ~8000 pages).

Here is an example of the Document structure in JSON (I removed a lot of meta data for brevity. Also, and this is important, the "tokens"-array contains up to 700 objects in most cases.):

{
    "_id" : ObjectId("5622c29eef86d3c2f23fd62c"),
    "scanId" : "592ea208b6d108ee5ae63f79",
    "volume" : "Volume I",
    "chapters" : [
        "Some Chapter Name"
    ],
    "languages" : [
        "English",
        "German"
    ],
    "tokens" : [
        {
            "form" : "The",
            "index" : 0,
            "tags" : [
                "ART"
            ]
        },
        {
            "form" : "house",
            "index" : 1,
            "tags" : [
                "NN",
                "NN_P"
            ]
        },
        {
            "form" : "is",
            "index" : 2,
            "tags" : [
                "V",
                "CONJ_C"
            ]
        }
    ]
}

So you see i don't have a plain text, here. I now want to build an index with Lucene to quickly search this DB. The problem is that i want to be able to search certain words, their tags AND the context around it. Like "give me all documents containing the word 'House' tagged as 'NN' followed by a word tagged with 'V'.". I couldn't find a way to index these sub-structures with native Lucene functionality.

What i tried to do to at least be able to search for words and their tags is the following: In my Lucene index, a document doesn't represent a whole page, but only a word/token with it's tags. So one index document looks like this (expressed in JSON syntax for readability):

{
    "token" : "house",
    "tag" : "NN",
    "tag" : "NN_P",
    "index" : 1,
    "pageId" : "5622c29eef86d3c2f23fd62c"
}

... Yes, Lucene allows me to use one field multiple times. So now i can search for a word and it's tags and get a reference to the page object in my DB via it's ID. But this is pretty ugly for two reasons: I now have two completely different document representations (DB and Lucene index) and to process a complex query like the one i mentioned above i'd have to query for the word and it's tag and then further check the context of the hits in the retrieved documents manually.

So my question is: Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties?

Answer 1

Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties?

Elasticsearch certainly lets you do this. I think it's possible to do all of it in pure lucene, but may be some effort.

Basically, you need to use the 'nested' query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html

PUT /my_index
{
    "mappings": {
        "type1" : {
            "properties" : {
                "tokens" : {
                    "type" : "nested"
                }
            }
        }
    }
}

This tells ES to index the contents of this field as a list of separate documents, allowing you to query them individually using the 'nested' query:

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "tokens",
      "query": {
        "bool": {
          "must": [
            { "match": { "tokens.form": "house" }},
            { "match": { "tokens.tags":  "NN" }} 
          ]
        }
      }
    }
  }
}

How to index documents containing nested properties with Lucene?

Question

1 answers

solution1
0 ACCPTED 2017-06-12 11:18:22

How to index documents containing nested properties with Lucene?

Question

1 answers

solution1 0 ACCPTED 2017-06-12 11:18:22

solution1
0 ACCPTED 2017-06-12 11:18:22