简体   繁体   English

Python,比较计数器中的字符串并分配给最接近的匹配项

[英]Python, compare strings in Counter and assign to closest match

I have a list of input text written by humans.我有一个人类编写的输入文本列表。 This text is imported to python and a Counter is generated.将此文本导入 python 并生成一个计数器。 In the Counter, all the inputs from humans are listed and counted.在计数器中,所有来自人类的输入都被列出并计数。 At the end obtain something like:最后获得类似的东西:

"Input 1" : 3, “输入 1”:3,

"Input 2" : 1, ... “输入 2”:1,...

The problem i have is that sometimes these inputs have spelling mistakes or are missing a space between words etc. How could I go through this list and compare it to some reference Inputs and asign to each counter row the total counts of the well written Inputs + the ones coming from the most similar Inputs with spelling mistakes.我遇到的问题是,有时这些输入有拼写错误或缺少单词之间的空格等。我如何浏览此列表并将其与一些参考输入进行比较,并将写得好的输入的总计数分配给每个计数器行来自具有拼写错误的最相似输入的输入。 I know this falls on the NLP field but i can't really find a way to do this in a counter我知道这属于 NLP 领域,但我真的找不到在柜台上做这件事的方法

在不应用任何 ML 的情况下,我的第一次尝试是使用Levenshtein Distance这将让您在字符串之间生成一些具体的相似性,并对“无错误”字符串和有错字的字符串之间的联系做出有根据的猜测。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM