如何匹配和打印列表中不匹配的字符串元素？

Question

My problem我的问题

I have two lists 'predicted' and 'reference'.我有两个列表“预测”和“参考”。 Each list contains strings, the first one being the predicted elements output by my model, and the latter being the gold-standard.每个列表都包含字符串，第一个是我的模型输出的预测元素，后者是黄金标准。 I want to build an automatic error classifier, but can't figure out compare each character within each string within each list.我想构建一个自动错误分类器，但无法弄清楚比较每个列表中每个字符串中的每个字符。 I can compare wordwise (code included below) but I want to look character-by-character.我可以逐字比较（下面包含的代码），但我想逐个字符地查看。

Below is the code for my word-wise comparer, along with the lists of data I'm working with NB, outside of this toy example, these lists are about 3000 items long.下面是我的逐字比较器的代码，以及我正在使用 NB 的数据列表，在这个玩具示例之外，这些列表大约有 3000 个项目。

predicted = ['r * a k t\n', 'd * o u l\n', 'm * i s l\n', 'p * i . v @ p\n']
reference = ['r A k t\n', 'd * o u b\n, 'm * i s l\n, 'i * p . v @ t\n']

########### word-wise finder ##############
p = set(predicted)
r = set(reference)
errors = p - r

return(errors)

My code above returns me:我上面的代码返回给我：

'r * a k t\n', 'd * o u l\n', 'p * i . v @ p\n'

My dream would be to have a returned list that looks like this:我的梦想是有一个看起来像这样的返回列表：

['* a', 'l', 'p * i', 'p']

because I can then look at each element an classify the mistake it's made.因为然后我可以查看每个元素并对其所犯的错误进行分类。 Any advice is appreciated.任何建议表示赞赏。

Answer 1

My best guess is that you are looking for a character by character diff of each pair of words.我最好的猜测是您正在寻找每对单词的逐字符差异。

Assuming that you're looking for a minimal difference and the order of the characters matters, https://docs.python.org/3/library/difflib.html provides a SequenceMatcher that implements the right algorithm.假设您正在寻找最小的差异并且字符的顺序很重要， https://docs.python.org/3/library/difflib.html提供了一个实现正确算法的SequenceMatcher 。 Its output is a little confusing.它的输出有点混乱。

import difflib
print(difflib.SequenceMatcher(a='r * a k t\n', b='r A k t\n').get_opcodes()
# printed: [('equal', 0, 2, 0, 2), ('replace', 2, 5, 2, 3), ('equal', 5, 10, 3, 8)]

Which literally means that characters in range(0, 2) == [0, 1] in each are the same.字面意思是range(0, 2) == [0, 1]中的字符是相同的。 That is, 'r ' matches).也就是说， 'r '匹配）。

Then the characters in range(2, 5) == [2, 3, 4] in the first string have to be replaced by the characters in range(2,3) == [2] in the second string.然后，第一个字符串中range(2, 5) == [2, 3, 4]中的字符必须替换为第二个字符串中range(2,3) == [2]中的字符。 So '* a' gets replaced with 'A' .所以'* a'被替换为'A' 。

And then the characters in range(5, 10) == [5, 6, 7, 8, 9] in the first string match the characters in range(3, 8) == [3, 4, 5, 6, 7] in the second string.然后第一个字符串中range(5, 10) == [5, 6, 7, 8, 9]中的字符匹配range(3, 8) == [3, 4, 5, 6, 7] range(5, 10) == [5, 6, 7, 8, 9]中的字符range(3, 8) == [3, 4, 5, 6, 7]在第二个字符串中。 In other words ' kt\\n' matches.换句话说， ' kt\\n'匹配。

For the format that you seem to be looking for (stuff in the first list not in the second), it suffices to look for only opcodes replace and delete .对于您似乎正在寻找的格式（第一个列表中的内容而不是第二个列表中的内容），只需查找操作码replace和delete 。 The other two opcodes are equal and insert .另外两个操作码是equal和insert 。

如何匹配和打印列表中不匹配的字符串元素？

问题描述

My problem我的问题

1 个解决方案

解决方案1
0 已采纳 2019-08-05 17:13:22

如何匹配和打印列表中不匹配的字符串元素？

问题描述

My problem我的问题

1 个解决方案

解决方案1 0 已采纳 2019-08-05 17:13:22

解决方案1
0 已采纳 2019-08-05 17:13:22