简体繁体中英

Is it possible to search for words inside a Lucene index by part of speech

原文 2013-04-13 13:53:20 3 1 java/ solr/ lucene/ nlp/ tokenize

I have a large set of documents stored inside a Lucene index and I am using a customAnalyzer which basically does tokenization and stemming for the documents content.

Now, if I search inside the documents for the word "love", I get results where love is being used either as a noun or a verb, while I want only those documents which use love only as a verb.

How can such s feature be implemented where I could also mention the part-of-speech of the word along with the word so that the results have only love used as a verb and not as a noun?

I can think of a way to initially part-of-speech tag each word of the document and store it by appending the POS with the word with a '_' or something and then to search accordingly, but wanted to know if there is a smarter way to do this in Lucene.

1 answers

I can think of following approaches.

Approach 1

Just like you mentioned: Recognize and append the part-of-speech tag to the actual term while indexing. Do the same while querying.

I would like to discuss the cons associated.

Cons:

1) Future requirements might demand you to get results irrespective of part-of-speech. The Index that contains modified terms won't work.

2) You might want to execute a BooleanQuery like "term: noun or adjective". You've to write the query expander yourself.

Approach 2

Try using Payloads feature of Lucene.

Here is a brief tutorial on Lucene Payloads .

Steps to address your use-case.

1) Store the part-of-speech tag in the form of a Payload.

2) Have custom Similarity classes for each part-of-speech tag.

3) Based on the query, assign the corresponding CustomSimilarity to the IndexSearcher. For example, assign NounBoostingSimilarity for a noun query.

4) Boost or "Reduce" the score of a document based on Payload. Example given in the above tutorial.

5) Write a custom collector to filter out the documents with scores not conforming to above score-boosting logic.

Pros of this approach is that the Index remains compatible for any other normal search.

Cons:

1) Maintenance overhead : have to maintain multiple IndexSearchers for each similarity. 2) Somewhat complicated-to-code solution.

To be frank, I'm not satisfied with my own solution, but just wanted to let you know that there exists another way. It all depends on your scenario, whether the project is an academic one-time project or a commercial one, etc.

How to get top words by lucene index and search?

Print words in the index - Lucene

Lucene : Search with partial words

search in lucene index

Lucene index inside of a database

Search for a specific term in a Lucene index

Missing hits on lucene index search

Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech?

Lucene create index for words with umlauts in stratio

How to get multple words in a search with Lucene 4

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to get top words by lucene index and search? Print words in the index - Lucene Lucene : Search with partial words search in lucene index Lucene index inside of a database Search for a specific term in a Lucene index Missing hits on lucene index search Tokenization, and indexing with Lucene, how to handle external tokenize and part-of-speech? Lucene create index for words with umlauts in stratio How to get multple words in a search with Lucene 4

Related Tags

Is it possible to search for words inside a Lucene index by part of speech

Question

1 answers

solution1 1 2013-04-13 17:26:11

solution1
1 2013-04-13 17:26:11