简体   繁体   English

如何处理 Rasa NLU 实体提取中的拼写错误(错别字)?

[英]how to handle spelling mistake(typos) in entity extraction in Rasa NLU?

I have few intents in my training set(nlu_data.md file) with sufficient amount of training examples under each intent.我的训练集(nlu_data.md 文件)中的意图很少,每个意图下都有足够数量的训练示例。 Following is an example,下面是一个例子,

##intent: SEARCH_HOTEL
- find good [hotel](place) for me in Mumbai

I have added multiple sentences like this.我添加了多个这样的句子。 At the time of testing, all sentences in training file are working fine.在测试时,训练文件中的所有句子都工作正常。 But if any input query is having spelling mistake eg, hotol/hetel/hotele for hotel keyword then Rasa NLU is unable to extract it as an entity.但是,如果任何输入查询有拼写错误,例如酒店关键字的 hotol/hetel/hotele,那么 Rasa NLU 无法将其提取为实体。

I want to resolve this issue.我想解决这个问题。 I am allowed to change only training data, also restricted not to write any custom component for this.我只能更改训练数据,也不能为此编写任何自定义组件。

To handle spelling mistakes like this in entities, you should add these examples to your training data.要处理实体中的此类拼写错误,您应该将这些示例添加到您的训练数据中。 So something like this:所以像这样:

##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place) in Chennai
 - [hetel](place) in Berlin please

Once you've added enough examples, the model should be able to generalise from the sentence structure.添加足够的示例后,模型应该能够从句子结构中进行概括。

If you're not using it already, it also makes sense to use the character-level CountVectorFeaturizer .如果您还没有使用它,那么使用字符级CountVectorFeaturizer也是有意义的。 That should be in the default pipeline described on this page already这应该已经在此页面上描述的默认管道中

One thing I would highly suggest you to use is to use look-up tables with fuzzywuzzy matching .我强烈建议您使用的一件事是使用带有模糊模糊匹配的查找表 If you have limited number of entities (like country names) look-up tables are quite fast, and fuzzy matching catches typos when that entity exists in your look-up table (searching for typo variations of those entities).如果您的实体数量有限(例如国家/地区名称),则查找表会非常快,并且当您的查找表中存在该实体时(搜索这些实体的拼写变体),模糊匹配会捕获拼写错误。 There's a whole blogpost about it here: on Rasa .这里有一篇关于它的完整博客文章:关于 Rasa There's a working implementation of fuzzy wuzzy as a custom component:有一个模糊 wuzzy 作为自定义组件的工作实现:

class FuzzyExtractor(Component):
    name = "FuzzyExtractor"
    provides = ["entities"]
    requires = ["tokens"]
    defaults = {}
    language_list  ["en"]
    threshold = 90

    def __init__(self, component_config=None, *args):
        super(FuzzyExtractor, self).__init__(component_config)

    def train(self, training_data, cfg, **kwargs):
        pass

    def process(self, message, **kwargs):

        entities = list(message.get('entities'))

        # Get file path of lookup table in json format
        cur_path = os.path.dirname(__file__)
        if os.name == 'nt':
            partial_lookup_file_path = '..\\data\\lookup_master.json'
        else:
            partial_lookup_file_path = '../data/lookup_master.json'
        lookup_file_path = os.path.join(cur_path, partial_lookup_file_path)

        with open(lookup_file_path, 'r') as file:
            lookup_data = json.load(file)['data']

            tokens = message.get('tokens')

            for token in tokens:

                # STOP_WORDS is just a dictionary of stop words from NLTK
                if token.text not in STOP_WORDS:

                    fuzzy_results = process.extract(
                                             token.text, 
                                             lookup_data, 
                                             processor=lambda a: a['value'] 
                                                 if isinstance(a, dict) else a, 
                                             limit=10)

                    for result, confidence in fuzzy_results:
                        if confidence >= self.threshold:
                            entities.append({
                                "start": token.offset,
                                "end": token.end,
                                "value": token.text,
                                "fuzzy_value": result["value"],
                                "confidence": confidence,
                                "entity": result["entity"]
                            })

        file.close()

        message.set("entities", entities, add_to_output=True)

But I didn't implement it, it was implemented and validated here: Rasa forum Then you will just pass it to your NLU pipeline in config.yml file.但我没有实现它,它在这里实现和验证: Rasa forum然后你只需将它传递到 config.yml 文件中的 NLU 管道。

Its a strange request that they ask you not to change the code or do custom components.他们要求您不要更改代码或自定义组件,这是一个奇怪的要求。

The approach you would have to take would be to use entity synonyms.您必须采取的方法是使用实​​体同义词。 A slight edit on a previous answer:对以前的答案稍作修改:

 ##intent: SEARCH_HOTEL
 - find good [hotel](place) for me in Mumbai 
 - looking for a [hotol](place:hotel) in Chennai
 - [hetel](place:hotel) in Berlin please

This way even if the user enters a typo, the correct entity will be extracted.这样,即使用户输入了拼写错误,也将提取正确的实体。 If you want this to be foolproof, I do not recommend hand-editing the intents.如果您希望这是万无一失的,我不建议手动编辑意图。 Use some kind of automated tool for generating the training data.使用某种自动化工具来生成训练数据。 Eg Generate misspelled words (typos)例如, 生成拼写错误的单词(错别字)

First of all, add samples for the most common typos for your entities as advised here首先,按照此处的建议为您的实体添加最常见的拼写错误示例

Beyond this, you need a spellchecker.除此之外,您还需要一个拼写检查器。

I am not sure whether there is a single library that can be used in the pipeline, but if not you need to create a custom component.我不确定是否有可以在管道中使用的单个库,但如果没有,您需要创建一个自定义组件。 Otherwise, dealing with only training data is not feasible.否则,仅处理训练数据是不可行的。 You can't create samples for each typo.您无法为每个错字创建示例。 Using Fuzzywuzzy is one of the ways, generally, it is slow and it doesn't solve all the issues.使用 Fuzzywuzzy 是其中一种方法,通常它很慢并且不能解决所有问题。 Universal Encoder is another solution.通用编码器是另一种解决方案。 There should be more options for spell correction, but you will need to write code in any way.应该有更多的拼写纠正选项,但您需要以任何方式编写代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM