简体   繁体   English

在 Python 中比较字符串以查找匹配单词的最佳方法是什么?

[英]What is the best way to compare strings to find matching words in Python?

I have two texts, text A and text B. Text B isn't an exact copy of text A, it has a lot of special characters which aren't in text A, but it is technically the same text.我有两个文本,文本 A 和文本 B。文本 B 不是文本 A 的精确副本,它有很多不在文本 A 中的特殊字符,但从技术上讲,它是相同的文本。 I need to compare the strings and map the counterparts in Text B to their counterparts in Text A.我需要比较字符串并将文本 B 中的对应项映射到文本 A 中的对应项。

The text isn't in English, and can't easily be translated to English, so the examples below are just to demonstrate a few of the problems.文字不是英文,也不容易翻译成英文,所以下面的例子只是为了说明一些问题。

Some words in text A are not in text B, but all words in text B should be in Text A:文本 A 中的某些单词不在文本 B 中,但文本 B 中的所有单词都应在文本 A 中:

text_a = "he experienced déjà vu"
text_b = ['he', 'experienced']

Some words in text B use different characters to text A, but are the same words:文本 B 中的某些单词与文本 A 使用不同的字符,但都是相同的单词:

text_a = "she owns & runs the cafe florae"
text_b = ['she', 'owns', 'and', 'runs', 'the', 'cefé', 'floræ']

Words in text B are generally in the right order, but not always:文本 B 中的单词通常按正确的顺序排列,但并非总是如此:

text_a = "an uneasy alliance"
text_b = ['uneasy', 'alliance', 'an']

Some words in text B are made up of smaller components, which are also included in text B, and these smaller components are unnecessary:文本 B 中的某些单词由较小的组件组​​成,这些组件也包含在文本 B 中,而这些较小的组件是不必要的:

text_a = "we should withdraw our claim"
text_b = ['we', 'should', 'with', 'draw', 'withdraw', 'our', 'claim']

Some words in text A are represented by two or more words in text B:文本 A 中的某些单词由文本 B 中的两个或多个单词表示:

text_a = "they undercut their competitors"
text_b = ['they', 'under', 'cut', 'their', 'competitors']

What I want to do is replace the words in text A with their counterparts from text B. To do this I need to write a function to match words between the two texts.我想要做的是用文本 B 中的对应词替换文本 A 中的单词。为此,我需要编写一个函数来匹配两个文本之间的单词。

I've tried writing a function that compares strings using the edit distance method from the nltk library in conjunction with a handful of RegEx.我已经尝试编写一个函数,该函数使用nltk库中的edit distance方法以及一些 RegEx 来比较字符串。 This only does an OK job, so I looked at using sequence alignment techniques from libraries like biopython , but I can't get my head around these.这只是一个不错的工作,所以我研究了使用biopython库中的sequence alignment技术,但我无法biopython这些。

In particular, while using edit distance it's very difficult to match words like 'under' and 'cut' to 'undercut', while also avoiding errors in short strings.特别是,在使用编辑距离时,很难将像“under”和“cut”这样的词与“undercut”相匹配,同时还要避免短字符串中的错误。 This is because in a sentence containing similar tokens, like 'to' and 'tu', these tokens have the same edit distance from something like 'tú', and would theoretically be equally valid candidates, though the obvious match here would be 'tu', not 'to'.这是因为在包含类似标记的句子中,例如 'to' 和 'tu',这些标记与诸如 'tú' 之类的词具有相同的编辑距离,并且理论上是同样有效的候选词,尽管这里的明显匹配是 'tu ',而不是'到'。

Is there any highly accurate way to match strings from text B in text A?有没有高度准确的方法来匹配文本 A 中文本 B 的字符串? I'd like to get an output like:我想得到如下输出:

text_a = "the cafe florae undercut their competitors then withdrew their claim"
text_b = ['the', 'café', 'floræ', 'under', 'cut', 'their', 'competitors', 'then',
          'with', 'drew', 'withdrew', 'their', 'claim']

match_list = some_matchfunc(text_a, text_b)

print(match_list)

[['the', 'the'], ['cafe', 'café'], ['florae', 'floræ'], ['undercut', 'under'],
 ['undercut', 'cut'], ['their', 'their'], ['competitors', 'competitors'], ['then', 'then'],
 ['withdrew', 'withdrew'], ['their', 'their'], ['claim', 'claim']]

Ideally, this would also include the index for the beginning and end of each matched word in text A, to avoid confusion, like with the word "their" which occurs twice below:理想情况下,这还包括文本 A 中每个匹配单词的开头和结尾的索引,以避免混淆,例如下面出现两次的单词“他们”:

[['the', [0, 3] 'the'], ['cafe', [4, 8] 'café'], ['florae', [9, 15] 'floræ'],
 ['undercut', [16, 24], 'under'], ['undercut', [16, 24], 'cut'], ['their', [25, 30], 'their'],
 ['competitors', [31, 42], 'competitors'], ['then', [43, 47], 'then'], ['withdrew', [48, 56], 'withdrew'],
 ['their', [57, 62], 'their'], ['claim', [63, 68], 'claim']]

As mentioned above, the text isn't in English, and translating it to compare words using NLP techniques isn't really feasible, so it does need to be based on string comparison.如上所述,文本不是英文的,使用 NLP 技术将其翻译成单词比较不太可行,因此确实需要基于字符串比较。 I think there must be some method or library already in existence which employs a more efficient sequence alignment algorithm than I can come up with using RegEx and edit distance, but I can't find one.我认为一定已经存在某种方法或库,它们采用了比我使用 RegEx 和编辑距离提出的更有效的序列比对算法,但我找不到。

Does anybody know of a highly accurate method for comparing strings to achieve this outcome?有人知道比较字符串以实现此结果的高度准确的方法吗?

The problem itself is very complex, and I would suggest a combination of dictionaries with suitable synonyms when suitable, and then falling back to a sequence alignment approach.问题本身非常复杂,我会建议在合适的时候将字典与合适的同义词组合起来,然后再回到序列比对的方法。 The implementations in biopython are probably not really suitable for this case (BLAST, for instance, relies in a score matrix that does not make sense for real words, only to nucleotide or amino acid sequences). biopython 中的实现可能并不真正适合这种情况(例如,BLAST 依赖于对真实单词没有意义的分数矩阵,仅对核苷酸或氨基酸序列有意义)。 I suggest you have a look at SequenceMatcher , which could do the job.我建议你看看SequenceMatcher ,它可以完成这项工作。 A very simple (albeit naive) solution is to do a pairwise alignment of all the candidates and pick the closest match.一个非常简单(尽管幼稚)的解决方案是对所有候选对象进行成对对齐并选择最接近的匹配项。 Depending on the complexity of the alignment, eg whether gaps/replacements are required (imagine "they're" -> "they are" ).取决于对齐的复杂性,例如是否需要间隙/替换(想象一下"they're" -> "they are" )。

Bear in mind that in some cases many-to-many, one-to-many, and many-to-one substitutions will be required (you have some of these in your example already).请记住,在某些情况下,将需要多对多、一对多和多对一替换(您的示例中已经有一些)。 This is not automatically solved by the sequence alignment, hence my suggestion to use a dictionary (a bidirectional one if can afford it).这不能通过序列比对自动解决,因此我建议使用字典(如果负担得起,则为双向字典)。 If the synonym corpus is very big, I would even consider a database for such tasks.如果同义词语料库很大,我什至会考虑使用数据库来完成此类任务。

Also, some of your examples require word-level substitutions and some require letter-level substitutions.此外,您的一些示例需要单词级别的替换,而有些示例需要字母级别的替换。 I suggest that you handle these separately.我建议你分开处理这些。 If you don't have to deal with typos, I would start with the bigger (word) scale, and then move on to the letter-level substitutions.如果您不必处理拼写错误,我将从更大的(单词)比例开始,然后继续进行字母级别的替换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM