简体   繁体   中英

Is it possible to search for words inside a Lucene index by part of speech

I have a large set of documents stored inside a Lucene index and I am using a customAnalyzer which basically does tokenization and stemming for the documents content.

Now, if I search inside the documents for the word "love", I get results where love is being used either as a noun or a verb, while I want only those documents which use love only as a verb.

How can such s feature be implemented where I could also mention the part-of-speech of the word along with the word so that the results have only love used as a verb and not as a noun?

I can think of a way to initially part-of-speech tag each word of the document and store it by appending the POS with the word with a '_' or something and then to search accordingly, but wanted to know if there is a smarter way to do this in Lucene.

I can think of following approaches.

Approach 1

Just like you mentioned: Recognize and append the part-of-speech tag to the actual term while indexing. Do the same while querying.

I would like to discuss the cons associated.

Cons:

1) Future requirements might demand you to get results irrespective of part-of-speech. The Index that contains modified terms won't work.

2) You might want to execute a BooleanQuery like "term: noun or adjective". You've to write the query expander yourself.

Approach 2

Try using Payloads feature of Lucene.

Here is a brief tutorial on Lucene Payloads .

Steps to address your use-case.

1) Store the part-of-speech tag in the form of a Payload.

2) Have custom Similarity classes for each part-of-speech tag.

3) Based on the query, assign the corresponding CustomSimilarity to the IndexSearcher. For example, assign NounBoostingSimilarity for a noun query.

4) Boost or "Reduce" the score of a document based on Payload. Example given in the above tutorial.

5) Write a custom collector to filter out the documents with scores not conforming to above score-boosting logic.

Pros of this approach is that the Index remains compatible for any other normal search.

Cons:

1) Maintenance overhead : have to maintain multiple IndexSearchers for each similarity. 2) Somewhat complicated-to-code solution.

To be frank, I'm not satisfied with my own solution, but just wanted to let you know that there exists another way. It all depends on your scenario, whether the project is an academic one-time project or a commercial one, etc.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM