简体   繁体   English

如何匹配和打印列表中不匹配的字符串元素?

[英]How can I match and print out elements of strings within a list that do not match?

My problem我的问题

I have two lists 'predicted' and 'reference'.我有两个列表“预测”和“参考”。 Each list contains strings, the first one being the predicted elements output by my model, and the latter being the gold-standard.每个列表都包含字符串,第一个是我的模型输出的预测元素,后者是黄金标准。 I want to build an automatic error classifier, but can't figure out compare each character within each string within each list.我想构建一个自动错误分类器,但无法弄清楚比较每个列表中每个字符串中的每个字符。 I can compare wordwise (code included below) but I want to look character-by-character.我可以逐字比较(下面包含的代码),但我想逐个字符地查看。

Below is the code for my word-wise comparer, along with the lists of data I'm working with NB, outside of this toy example, these lists are about 3000 items long.下面是我的逐字比较器的代码,以及我正在使用 NB 的数据列表,在这个玩具示例之外,这些列表大约有 3000 个项目。

predicted = ['r * a k t\n', 'd * o u l\n', 'm * i s l\n', 'p * i . v @ p\n']
reference = ['r A k t\n', 'd * o u b\n, 'm * i s l\n, 'i * p . v @ t\n']

########### word-wise finder ##############
p = set(predicted)
r = set(reference)
errors = p - r

return(errors)

My code above returns me:我上面的代码返回给我:

'r * a k t\n', 'd * o u l\n', 'p * i . v @ p\n'

My dream would be to have a returned list that looks like this:我的梦想是有一个看起来像这样的返回列表:

['* a', 'l', 'p * i', 'p']

because I can then look at each element an classify the mistake it's made.因为然后我可以查看每个元素并对其所犯的错误进行分类。 Any advice is appreciated.任何建议表示赞赏。

My best guess is that you are looking for a character by character diff of each pair of words.我最好的猜测是您正在寻找每对单词的逐字符差异。

Assuming that you're looking for a minimal difference and the order of the characters matters, https://docs.python.org/3/library/difflib.html provides a SequenceMatcher that implements the right algorithm.假设您正在寻找最小的差异并且字符的顺序很重要, https://docs.python.org/3/library/difflib.html提供了一个实现正确算法的SequenceMatcher Its output is a little confusing.它的输出有点混乱。

import difflib
print(difflib.SequenceMatcher(a='r * a k t\n', b='r A k t\n').get_opcodes()
# printed: [('equal', 0, 2, 0, 2), ('replace', 2, 5, 2, 3), ('equal', 5, 10, 3, 8)]

Which literally means that characters in range(0, 2) == [0, 1] in each are the same.字面意思是range(0, 2) == [0, 1]中的字符是相同的。 That is, 'r ' matches).也就是说, 'r '匹配)。

Then the characters in range(2, 5) == [2, 3, 4] in the first string have to be replaced by the characters in range(2,3) == [2] in the second string.然后,第一个字符串中range(2, 5) == [2, 3, 4]中的字符必须替换为第二个字符串中range(2,3) == [2]中的字符。 So '* a' gets replaced with 'A' .所以'* a'被替换为'A'

And then the characters in range(5, 10) == [5, 6, 7, 8, 9] in the first string match the characters in range(3, 8) == [3, 4, 5, 6, 7] in the second string.然后第一个字符串中range(5, 10) == [5, 6, 7, 8, 9]中的字符匹配range(3, 8) == [3, 4, 5, 6, 7] range(5, 10) == [5, 6, 7, 8, 9]中的字符range(3, 8) == [3, 4, 5, 6, 7]在第二个字符串中。 In other words ' kt\\n' matches.换句话说, ' kt\\n'匹配。

For the format that you seem to be looking for (stuff in the first list not in the second), it suffices to look for only opcodes replace and delete .对于您似乎正在寻找的格式(第一个列表中的内容而不是第二个列表中的内容),只需查找操作码replacedelete The other two opcodes are equal and insert .另外两个操作码是equalinsert

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM