简体繁体 English

半自动注释工具 - 如何查找RDF三元组

[英]Semi-automatic annotation tool - How to find RDF Triplets

原文 2012-04-28 21:44:41 2 1 annotations/ rdf/ named-entity-recognition/ named-entity-extraction

I'm developing a semi-automatic annotation tool for medical texts and I am completely lost in finding the RDF triplets for annotation. 我正在为医学文本开发一个半自动注释工具，我完全迷失了寻找注释的RDF三元组。

I am currently trying to use an NLP based approach. 我目前正在尝试使用基于NLP的方法。 I have already looked into Stanford NER and OpenNLP and they both do not have models for extracting disease names. 我已经研究过Stanford NER和OpenNLP，他们都没有提取疾病名称的模型。

My question is: * How can I create a new NER model for extracting disease names? 我的问题是：*如何创建一个新的NER模型来提取疾病名称？ and can I get any help from the OpenNLP or Standford NERs? 我可以从OpenNLP或Standford NER获得任何帮助吗？ * Is there another approach all-together - other than NLP - to extracting the RDF triplets from a text? *除了NLP之外还有另一种方法 - 从文本中提取RDF三元组吗？

Any help would be appreciated! 任何帮助，将不胜感激！ Thanks. 谢谢。

1 个解决方案

I have done something similar to what you need both with OpenNLP and LingPipe. 我已经做了类似于你需要OpenNLP和LingPipe的东西。 I found the exact dictionary-based chunking of LingPipe good enough for my use case and used that. 我发现LingPipe的确切的基于字典的分块足够我的用例并使用它。 Documentation available here: http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html 此处提供的文档： http ： //alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

You can find a small demo here: 你可以在这里找到一个小型演示：

https://github.com/castagna/nerdf https://github.com/castagna/nerdf

If a gazetteer/dictionary approach isn't good enough for you, you can try creating your own model, OpenNLP has API for training models as well. 如果一个地名词典/词典方法对你来说不够好，你可以尝试创建自己的模型，OpenNLP也有训练模型的API。 Documentation is here: http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training 文档在这里： http ： //opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.namefind.training

Extracting RDF triples from natural language is a different problem than identify named entities. 从自然语言中提取RDF三元组是一个与识别命名实体不同的问题。 NER is a related and perhaps necessary step, but not enough. NER是一个相关的，也许是必要的步骤，但还不够。 To extract an RDF statement from natural language not only you need to identify entities such as the subject and the object of a statement. 要从自然语言中提取RDF语句，您不仅需要识别诸如主语和语句对象之类的实体。 But you also need to identify the verb and/or relationship of those entities and also you need to map those to URIs. 但是您还需要识别这些实体的动词和/或关系，还需要将这些实体映射到URI。