简体繁体 English

是否可以通过词性搜索Lucene索引中的单词

[英]Is it possible to search for words inside a Lucene index by part of speech

原文 2013-04-13 13:53:20 2 1 java/ solr/ lucene/ nlp/ tokenize

I have a large set of documents stored inside a Lucene index and I am using a customAnalyzer which basically does tokenization and stemming for the documents content. 我有一大堆文档存储在Lucene索引中，我使用的是customAnalyzer，它基本上为文档内容进行标记化和源代码。

Now, if I search inside the documents for the word "love", I get results where love is being used either as a noun or a verb, while I want only those documents which use love only as a verb. 现在，如果我在文档中搜索单词“love”，我会得到结果，其中爱被用作名词或动词，而我只想要那些仅用爱作为动词的文档。

How can such s feature be implemented where I could also mention the part-of-speech of the word along with the word so that the results have only love used as a verb and not as a noun? 如何才能实现这样的功能，我还可以提到单词的词性以及单词，这样结果只能用作动词而不是名词？

I can think of a way to initially part-of-speech tag each word of the document and store it by appending the POS with the word with a '_' or something and then to search accordingly, but wanted to know if there is a smarter way to do this in Lucene. 我可以想出一种方法，最初通过词性标记文档的每个单词并通过附加带有“_”或其他东西的单词来存储它然后进行相应的搜索，但是想知道是否存在在Lucene做到这一点的更聪明的方法。

1 个解决方案

I can think of following approaches. 我可以想到以下方法。

Approach 1 方法1

Just like you mentioned: Recognize and append the part-of-speech tag to the actual term while indexing. 就像你提到的那样：在索引时识别并将词性标记附加到实际术语。 Do the same while querying. 查询时也一样。

I would like to discuss the cons associated. 我想讨论相关的利弊。

Cons: 缺点：

1) Future requirements might demand you to get results irrespective of part-of-speech. 1）未来的要求可能要求您获得结果而不管词性。 The Index that contains modified terms won't work. 包含已修改术语的索引将不起作用。

2) You might want to execute a BooleanQuery like "term: noun or adjective". 2）您可能希望执行类似“term：noun或adjective”的BooleanQuery。 You've to write the query expander yourself. 你自己编写查询扩展器。

Approach 2 方法2

Try using Payloads feature of Lucene. 尝试使用Lucene的Payloads功能。

Here is a brief tutorial on Lucene Payloads . 这是Lucene Payloads的简要教程。

Steps to address your use-case. 解决您的用例的步骤。

1) Store the part-of-speech tag in the form of a Payload. 1）以有效载荷的形式存储词性标签。

2) Have custom Similarity classes for each part-of-speech tag. 2）为每个词性标签定制相似度类。

3) Based on the query, assign the corresponding CustomSimilarity to the IndexSearcher. 3）根据查询，将相应的CustomSimilarity分配给IndexSearcher。 For example, assign NounBoostingSimilarity for a noun query. 例如，为名词查询指定NounBoostingSimilarity。

4) Boost or "Reduce" the score of a document based on Payload. 4）基于Payload提升或“降低”文档的分数。 Example given in the above tutorial. 上面教程中给出的示例。

5) Write a custom collector to filter out the documents with scores not conforming to above score-boosting logic. 5）编写一个自定义收集器来过滤掉不符合上述得分提升逻辑的分数的文档。

Pros of this approach is that the Index remains compatible for any other normal search. 这种方法的优点是索引保持与任何其他正常搜索兼容。

Cons: 缺点：

1) Maintenance overhead : have to maintain multiple IndexSearchers for each similarity. 1）维护开销：必须为每个相似性维护多个IndexSearchers。 2) Somewhat complicated-to-code solution. 2）有点复杂的代码解决方案。

To be frank, I'm not satisfied with my own solution, but just wanted to let you know that there exists another way. 坦率地说，我对自己的解决方案并不满意，但只是想让你知道存在另一种方式。 It all depends on your scenario, whether the project is an academic one-time project or a commercial one, etc. 这完全取决于您的场景，项目是学术性的一次性项目还是商业项目等。