简体   繁体   English

是否可以从字符对中制作混淆矩阵?

[英]Is it possible to make a confusion matrix from character pairs?

In the context of optical character recognition, I will try to summarize my issue at best:在光学字符识别的背景下,我将尽力总结我的问题:

  1. I have reference sentence and a prediction sentence.我有参考句和预测句。

  2. With Levenshtein editops function , I made a list that contains a tuple which contains: a step type (insertion, replace, substitution), the character modified in reference sequence, a character modified in a prediction sequence, and finally the number of times these changes are made in all of the reference sentence (in fact, maximum number of occurrences where these pairs of errors return)使用Levenshtein editops function ,我制作了一个列表,其中包含一个元组,其中包含:步骤类型(插入、替换、替换)、在参考序列中修改的字符、在预测序列中修改的字符,最后是这些更改的次数在所有参考句子中都进行了(实际上,这些错误对返回的最大出现次数)

[(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]
  1. Is it possible to make a sort of "confusion matrix" with these pairs of errors and maximum number of occurrences, from the previous list?是否可以从上一个列表中用这些错误对和最大出现次数制作一种“混淆矩阵”? like this:像这样:

Output example Output 示例

Predicted         e         m         t          a        r   ...continue
Reference
e                 1         11        0          0        0
m                 0         0         0          0        0
t                 0         0         0          8        0 
a                 2         0         0          0        0
r                 0         0         0          0        0
...continue                                         

or like this (without labels):或像这样(没有标签):

[[1         11        0          0        0
  0         0         0          0        0
  0         0         0          8        0 
  2         0         0          0        0
  0         0         0          0        0]]

Note: the value 0 is replaced by default in this 'matrix' examples when a character error pair is not encountered.注意:当未遇到字符错误对时,此“矩阵”示例中的默认值 0 将被替换。

  1. in a second time, is it possible to obtain a visualization of this 'matrix'?第二次,是否有可能获得这个“矩阵”的可视化? with matplotlib or seaborn, for example.例如,使用 matplotlib 或 seaborn。

a track to solve it?一个轨道来解决它? thanks in advance.提前致谢。

I would do this with Counters:我会用计数器做到这一点:

operations = [(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]

from collections import defaultdict, Counter

intermediary = defaultdict(Counter)
for (_, src, tgt), count in source:
    intermediary[src][tgt] = count

letters = sorted({key for inner in intermediary.values() for key in inner} | set(intermediary.keys()))

confusion_matrix = [[intermediary[src][tgt] for tgt in letters] for src in letters]

The result looks something like this:结果如下所示:

[[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 11, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

For plotting, refer to the answers of this question :对于绘图,请参阅此问题的答案

import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

df_cm = pd.DataFrame(confusion_matrix, letters, letters)
sn.set(font_scale=1.4) # for label size
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size

plt.show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM