简体繁体 English

如何在Lucene中进行实体提取

[英]How do I do Entity Extraction in Lucene

原文 2010-11-29 21:09:55 2 5 lucene/ named-entity-extraction

I m trying to do Entity Extraction (more like matching) in Lucene. 我正在尝试在Lucene中进行实体提取（更像匹配）。 Here is a sample workflow: 这是一个示例工作流程：

Given some text (from a URL) AND a list people names, try to extract names of people from the text. 给定一些文本（来自URL）并列出人员名称，请尝试从文本中提取人员名称。

Note: 注意：

Names of people are not completely normalized. 人名尚未完全标准化。 eg Some are Mr. X, Mrs. Y and some are just John Doe, X and Y. Other prefixes and suffixes to think about are Jr., Sr., Dr., I, II ... etc. (dont let me get started with non US names). 例如，有些是X先生，Y太太，有些只是John Doe，X和Y。要考虑的其他前缀和后缀是Jr.，Sr.，Dr.，I，II ...等（不要让我开始使用非美国名称）。

I am using Lucene MemoryIndex to create an in memory index of the text from each Url (stripping html tags) and am using StandardAnalyzer to query for the list of all names, one at a time (100k names, Is there any other way to do this? On an avg. this takes about 8 secs. on the average text I have). 我正在使用Lucene MemoryIndex从每个Url（剥离html标签）创建文本的内存索引，并正在使用StandardAnalyzer一次查询所有名称的列表（100k个名称，还有其他方法可做）平均而言，这大约需要8秒钟。

A major problem is that to eliminate noise I m using a score of 0.01 as a base score and queries like "Mr. John Doe" have a significantly lower score as compared to "John Doe" if the text contains "John Doe" and in many cases miss the 0.01 threshold. 一个主要问题是，使用0.01的分数作为基本分数来消除噪声I m，如果文本包含“ John Doe”，则“ John Doe先生”之类的查询的得分比“ John Doe”要低得多。许多情况下未达到0.01阈值。

The other problem is that If I normalize all names and start removing all occurences of Dr. Mr. Mrs. etc. then I start missing good matches like "Dr. John Edward II" and end up with a lot of junk matches like "Mr. John Edward". 另一个问题是，如果我对所有姓名进行规范化处理并开始删除Mrs. Mrs.等的所有出现，那么我会开始错过像“ John Edward II博士”这样的优质比赛，最终会遇到很多诸如“ Mr. Mr.约翰·爱德华”

I understand that Lucene might not be the right tool for the job either, but so far it hasnt proved to be too bad. 我知道Lucene可能也不是完成这项工作的正确工具，但到目前为止，事实证明还不错。 Any help appreciated. 任何帮助表示赞赏。

5 个解决方案

NEE is an NLP task that is not part of lucene. NEE是不属于Lucene的NLP任务。 For open source, you can look at lingpipe and gate and opennlp. 对于开源，您可以查看lingpipe以及gate和opennlp。 There are various for-money alternatives. 有多种物有所值的选择。

GATE is entirely rule-based, and will be hard to use for high precision. GATE完全基于规则，因此很难用于高精度。 You'll need a statistical engine for that; 您将需要一个统计引擎； lingpipe has one, but you have to supply the training data. lingpipe有一个，但您必须提供训练数据。 I'm not up to date on the contents of opennlp in this area. 我对这方面的opennlp的内容不是最新的。

OpenNPL is useful. OpenNPL非常有用。 http://opennlp.apache.org/ http://opennlp.apache.org/

The site has documentation and examples. 该站点包含文档和示例。

For the completely uninitiated The book Taming Text : http://www.manning.com/ingersoll/ provides a good overview. 对于完全入门的书籍Taming Text： http ： //www.manning.com/ingersoll/提供了很好的概述。 You can also download the source code from the book from the above link. 您也可以从上面的链接下载本书的源代码。

You can try this.. http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 您可以尝试一下。.http ://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Documenataion is clear, you can also use DBPedia-Spotlight webservice too... 文档很明确，您也可以使用DBPedia-Spotlight Web服务...

http://spotlight.dbpedia.org/rest/spot/?text= http://spotlight.dbpedia.org/rest/spot/?text=

Disambiguation of human names is notoriously difficult. 消除人名歧义是非常困难的。 If you have other information such as locations, or co-occurrence of names this will be valuable. 如果您还有其他信息，例如位置或名称的共现，这将很有价值。 But there is a lot of work still going into author disambiguation and it cannot normally be solved just from a list of names. 但是，仍有许多工作要消除作者的歧义，通常不能仅从名称列表中解决它。

Here is a typical project http://code.google.com/p/bibapp/wiki/AuthorAuthorities . 这是一个典型的项目http://code.google.com/p/bibapp/wiki/AuthorAuthorities 。 And a typical publication http://www.springerlink.com/content/lk07h1m311t130w4/ . 还有一个典型的出版物http://www.springerlink.com/content/lk07h1m311t130w4/ 。

Here is a project on record deduplications which we find useful for author disambiguation http://datamining.anu.edu.au/projects/linkage.html 这是一个有关记录重复数据删除的项目，我们发现该项目有助于作者消除歧义http://datamining.anu.edu.au/projects/linkage.html

These projects could be useful for you: 这些项目可能对您有用：

http://nlp.stanford.edu/ner/index.shtml http://nlp.stanford.edu/ner/index.shtml

http://cogcomp.cs.illinois.edu/page/software_view/4 http://cogcomp.cs.illinois.edu/page/software_view/4