[英]Is it possible to make a confusion matrix from character pairs?
In the context of optical character recognition, I will try to summarize my issue at best:在光学字符识别的背景下,我将尽力总结我的问题:
I have reference sentence and a prediction sentence.我有参考句和预测句。
With Levenshtein editops function , I made a list that contains a tuple which contains: a step type (insertion, replace, substitution), the character modified in reference sequence, a character modified in a prediction sequence, and finally the number of times these changes are made in all of the reference sentence (in fact, maximum number of occurrences where these pairs of errors return)使用Levenshtein editops function ,我制作了一个列表,其中包含一个元组,其中包含:步骤类型(插入、替换、替换)、在参考序列中修改的字符、在预测序列中修改的字符,最后是这些更改的次数在所有参考句子中都进行了(实际上,这些错误对返回的最大出现次数)
[(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]
Output example Output 示例
Predicted e m t a r ...continue
Reference
e 1 11 0 0 0
m 0 0 0 0 0
t 0 0 0 8 0
a 2 0 0 0 0
r 0 0 0 0 0
...continue
or like this (without labels):或像这样(没有标签):
[[1 11 0 0 0
0 0 0 0 0
0 0 0 8 0
2 0 0 0 0
0 0 0 0 0]]
Note: the value 0 is replaced by default in this 'matrix' examples when a character error pair is not encountered.注意:当未遇到字符错误对时,此“矩阵”示例中的默认值 0 将被替换。
a track to solve it?一个轨道来解决它? thanks in advance.提前致谢。
I would do this with Counters:我会用计数器做到这一点:
operations = [(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]
from collections import defaultdict, Counter
intermediary = defaultdict(Counter)
for (_, src, tgt), count in source:
intermediary[src][tgt] = count
letters = sorted({key for inner in intermediary.values() for key in inner} | set(intermediary.keys()))
confusion_matrix = [[intermediary[src][tgt] for tgt in letters] for src in letters]
The result looks something like this:结果如下所示:
[[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 11, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
For plotting, refer to the answers of this question :对于绘图,请参阅此问题的答案:
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
df_cm = pd.DataFrame(confusion_matrix, letters, letters)
sn.set(font_scale=1.4) # for label size
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size
plt.show()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.