[英]extending NLP entity extraction
We would like to identify from a simple search neighborhood and streets in various cities.我们想从一个简单的搜索社区和各个城市的街道中进行识别。 We don't only use English but also various other Cyrillic languages.我们不仅使用英语,还使用其他各种西里尔语。 We need to be able to identify spelling mistakes of locations.我们需要能够识别位置的拼写错误。 When looking at python libraries, I found this one: http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html在查看 python 库时,我发现了这个: http : //polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html
We tried to play around with it, but cannot find a way to extend the entity recognition database.我们尝试使用它,但找不到扩展实体识别数据库的方法。 How can that be done?那怎么办呢?
If not is there any other suggestion for a multi lingual nlp that can help spell check and also extract various entities matching a custom database?如果没有,是否还有其他建议可以帮助进行拼写检查并提取与自定义数据库匹配的各种实体的多语言 nlp?
Have a look at HuggingFace 's pretrained models.看看HuggingFace的预训练模型。
Here is some example code from the documentation slightly altered for your use-case:以下是文档中的一些示例代码,针对您的用例略有改动:
from transformers import pipeline
typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")
result = typo_checker("я живу в Мосве")
result[1:-1]
#[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
#{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
#{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
#{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
#{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
#{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
#{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]
result = typo_checker("I live in Moskkow")
result[1:-1]
#[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
#{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
#{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
#{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
#{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
#{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]
It doesn't seem to always work, unfortunately, but maybe it's sufficient for your use case.不幸的是,它似乎并不总是有效,但对于您的用例来说可能已经足够了。
Another option would be SpaCy .另一种选择是SpaCy 。 They don't have as many models for different languages, but with SpaCy's EntityRuler it's easy to manually define new entities ie "extend the entity recognition database".他们没有针对不同语言的那么多模型,但是使用SpaCy 的 EntityRuler可以轻松手动定义新实体,即“扩展实体识别数据库”。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.