简体   繁体   English

扩展 NLP 实体提取

[英]extending NLP entity extraction

We would like to identify from a simple search neighborhood and streets in various cities.我们想从一个简单的搜索社区和各个城市的街道中进行识别。 We don't only use English but also various other Cyrillic languages.我们不仅使用英语,还使用其他各种西里尔语。 We need to be able to identify spelling mistakes of locations.我们需要能够识别位置的拼写错误。 When looking at python libraries, I found this one: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html在查看 python 库时,我发现了这个: http : //polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

We tried to play around with it, but cannot find a way to extend the entity recognition database.我们尝试使用它,但找不到扩展实体识别数据库的方法。 How can that be done?那怎么办呢?
If not is there any other suggestion for a multi lingual nlp that can help spell check and also extract various entities matching a custom database?如果没有,是否还有其他建议可以帮助进行拼写检查并提取与自定义数据库匹配的各种实体的多语言 nlp?

Have a look at HuggingFace 's pretrained models.看看HuggingFace的预训练模型。

  1. They have a multilingual NER model trained on 40 languages, including Cyrillic languages like Russian.他们有一个多语言 NER 模型,训练了 40 种语言,包括俄语等西里尔语。 It's a fine-tuned version of RoBERTa, so accuracy seems to be very good.它是 RoBERTa 的微调版本,因此准确性似乎非常好。 See details here: https://huggingface.co/jplu/tf-xlm-r-ner-40-lang在此处查看详细信息: https : //huggingface.co/jplu/tf-xlm-r-ner-40-lang
  2. They also have a multilingual DistilBERT model trained for typo detection based on the GitHub Typo Corpus .他们还有一个多语言 DistilBERT 模型,用于基于GitHub Typo Corpus进行错字检测训练。 The corpus seems to include typos from 15 different languages, including Russian.语料库似乎包括来自 15 种不同语言的错别字,包括俄语。 See details here: https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection在此处查看详细信息: https : //huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection

Here is some example code from the documentation slightly altered for your use-case:以下是文档中的一些示例代码,针对您的用例略有改动:

from transformers import pipeline

typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
                        tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")

result = typo_checker("я живу в Мосве")
result[1:-1]

 #[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
 #{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
 #{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
 #{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
 #{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
 #{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
 #{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]

result = typo_checker("I live in Moskkow")
result[1:-1]

 #[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
 #{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
 #{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
 #{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
 #{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
 #{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]

It doesn't seem to always work, unfortunately, but maybe it's sufficient for your use case.不幸的是,它似乎并不总是有效,但对于您的用例来说可能已经足够了。

Another option would be SpaCy .另一种选择是SpaCy They don't have as many models for different languages, but with SpaCy's EntityRuler it's easy to manually define new entities ie "extend the entity recognition database".他们没有针对不同语言的那么多模型,但是使用SpaCy 的 EntityRuler可以轻松手动定义新实体,即“扩展实体识别数据库”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM