[英]How can I match and print out elements of strings within a list that do not match?
I have two lists 'predicted' and 'reference'.我有两个列表“预测”和“参考”。 Each list contains strings, the first one being the predicted elements output by my model, and the latter being the gold-standard.
每个列表都包含字符串,第一个是我的模型输出的预测元素,后者是黄金标准。 I want to build an automatic error classifier, but can't figure out compare each character within each string within each list.
我想构建一个自动错误分类器,但无法弄清楚比较每个列表中每个字符串中的每个字符。 I can compare wordwise (code included below) but I want to look character-by-character.
我可以逐字比较(下面包含的代码),但我想逐个字符地查看。
Below is the code for my word-wise comparer, along with the lists of data I'm working with NB, outside of this toy example, these lists are about 3000 items long.下面是我的逐字比较器的代码,以及我正在使用 NB 的数据列表,在这个玩具示例之外,这些列表大约有 3000 个项目。
predicted = ['r * a k t\n', 'd * o u l\n', 'm * i s l\n', 'p * i . v @ p\n']
reference = ['r A k t\n', 'd * o u b\n, 'm * i s l\n, 'i * p . v @ t\n']
########### word-wise finder ##############
p = set(predicted)
r = set(reference)
errors = p - r
return(errors)
My code above returns me:我上面的代码返回给我:
'r * a k t\n', 'd * o u l\n', 'p * i . v @ p\n'
My dream would be to have a returned list that looks like this:我的梦想是有一个看起来像这样的返回列表:
['* a', 'l', 'p * i', 'p']
because I can then look at each element an classify the mistake it's made.因为然后我可以查看每个元素并对其所犯的错误进行分类。 Any advice is appreciated.
任何建议表示赞赏。
My best guess is that you are looking for a character by character diff of each pair of words.我最好的猜测是您正在寻找每对单词的逐字符差异。
Assuming that you're looking for a minimal difference and the order of the characters matters, https://docs.python.org/3/library/difflib.html provides a SequenceMatcher
that implements the right algorithm.假设您正在寻找最小的差异并且字符的顺序很重要, https://docs.python.org/3/library/difflib.html提供了一个实现正确算法的
SequenceMatcher
。 Its output is a little confusing.它的输出有点混乱。
import difflib
print(difflib.SequenceMatcher(a='r * a k t\n', b='r A k t\n').get_opcodes()
# printed: [('equal', 0, 2, 0, 2), ('replace', 2, 5, 2, 3), ('equal', 5, 10, 3, 8)]
Which literally means that characters in range(0, 2) == [0, 1]
in each are the same.字面意思是
range(0, 2) == [0, 1]
中的字符是相同的。 That is, 'r '
matches).也就是说,
'r '
匹配)。
Then the characters in range(2, 5) == [2, 3, 4]
in the first string have to be replaced by the characters in range(2,3) == [2]
in the second string.然后,第一个字符串中
range(2, 5) == [2, 3, 4]
中的字符必须替换为第二个字符串中range(2,3) == [2]
中的字符。 So '* a'
gets replaced with 'A'
.所以
'* a'
被替换为'A'
。
And then the characters in range(5, 10) == [5, 6, 7, 8, 9]
in the first string match the characters in range(3, 8) == [3, 4, 5, 6, 7]
in the second string.然后第一个字符串中
range(5, 10) == [5, 6, 7, 8, 9]
中的字符匹配range(3, 8) == [3, 4, 5, 6, 7]
range(5, 10) == [5, 6, 7, 8, 9]
中的字符range(3, 8) == [3, 4, 5, 6, 7]
在第二个字符串中。 In other words ' kt\\n'
matches.换句话说,
' kt\\n'
匹配。
For the format that you seem to be looking for (stuff in the first list not in the second), it suffices to look for only opcodes replace
and delete
.对于您似乎正在寻找的格式(第一个列表中的内容而不是第二个列表中的内容),只需查找操作码
replace
和delete
。 The other two opcodes are equal
and insert
.另外两个操作码是
equal
和insert
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.