简体繁体 English

NER - 提取长实体 - 语音聊天机器人

[英]NER - Extract long entities - voice chatbot

原文 2022-12-26 00:51:03 8 2 spacy/ named-entity-recognition/ rasa/ crf

Building a voice Chatbot to do some specific tasks (intents), eg translation,构建一个语音聊天机器人来完成一些特定的任务（意图），例如翻译，
Issue is I m having long entities:问题是我有很长的实体：
input from user: "translate to German The Eminem Show 20th Anniversary launched earlier this year" I need to extract following entities:来自用户的输入：“翻译成德语 Eminem Show 20th Anniversary launched earlier this year” 我需要提取以下实体：

("German", "LanguageTo") ("德语", "LanguageTo")
("The Eminem Show 20th Anniversary launched earlier this year", "text") （“今年早些时候推出的埃米纳姆秀 20 周年”，“文”）

I tried using Spacy to train custom ner, but it is doing bad on long entities (not catching the whole "text" entity), "CRF" and "DIETClassifier" within Rasa are better, but not really good,我尝试使用 Spacy 来训练自定义 ner，但它在长实体上做得不好（没有捕获整个“文本”实体），Rasa 中的“CRF”和“DIETClassifier”更好，但不是很好，

Do you think extracting the long "text" entity is not a NER task?你认为提取长“文本”实体不是 NER 任务吗？ Any recommendations I would be delighted!任何建议我都会很高兴！

NB: text I m getting from the user (as it is a voice chatbot) has no punctuation nor casing (full text is lowercase) and could be much longer than the example I gave注意：我从用户那里得到的文本（因为它是一个语音聊天机器人）没有标点符号也没有大小写（全文是小写的）并且可能比我给出的例子长得多

2 个解决方案

You're right that this isn't really an NER problem - while in the most general sense NER covers any selection of text from input, many NER models are designed for short proper nouns.你是对的，这不是一个真正的 NER 问题——虽然在最一般的意义上 NER 涵盖了从输入中选择的任何文本，但许多 NER 模型是为短专有名词设计的。 A side effect of that is that they're sensitive to where the spans start and end, and have trouble representing long spans.这样做的一个副作用是它们对跨度的开始和结束位置很敏感，并且难以表示长跨度。

In the case of spaCy, the spancat component was designed to have less edge sensitivity, and should be a better fit for problems like the one you have.在 spaCy 的情况下， spancat组件被设计为具有较低的边缘敏感性，并且应该更适合解决您遇到的问题。 It's still kind of a difficult problem, but should do better than NER.这仍然是一个难题，但应该比 NER 做得更好。

Backing up a bit, you might want to consider whether you actually need to use a model to find things like the language to translate to - you could just use a list of languages, for example.稍微备份一下，您可能需要考虑是否真的需要使用 model 来查找要翻译成的语言之类的东西——例如，您可以只使用一个语言列表。 You could also have an inflexible command structure if you have a small number of well-defined commands.如果您有少量明确定义的命令，您也可能拥有不灵活的命令结构。

I would recommend you use whisper from openAi.我建议你使用 openAi 的whisper 。 It adds automatically punctuation when fit and thus you could likely do the entity/text separation.它会在合适时自动添加标点符号，因此您可能会进行实体/文本分离。 You could also use POS tagging from spacy to detect parts of your speech and extract language.您还可以使用 spacy 中的 POS 标记来检测您的部分语音并提取语言。