简体   繁体   中英

Is it possible to make a confusion matrix from character pairs?

In the context of optical character recognition, I will try to summarize my issue at best:

  1. I have reference sentence and a prediction sentence.

  2. With Levenshtein editops function , I made a list that contains a tuple which contains: a step type (insertion, replace, substitution), the character modified in reference sequence, a character modified in a prediction sequence, and finally the number of times these changes are made in all of the reference sentence (in fact, maximum number of occurrences where these pairs of errors return)

[(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]
  1. Is it possible to make a sort of "confusion matrix" with these pairs of errors and maximum number of occurrences, from the previous list? like this:

Output example

Predicted         e         m         t          a        r   ...continue
Reference
e                 1         11        0          0        0
m                 0         0         0          0        0
t                 0         0         0          8        0 
a                 2         0         0          0        0
r                 0         0         0          0        0
...continue                                         

or like this (without labels):

[[1         11        0          0        0
  0         0         0          0        0
  0         0         0          8        0 
  2         0         0          0        0
  0         0         0          0        0]]

Note: the value 0 is replaced by default in this 'matrix' examples when a character error pair is not encountered.

  1. in a second time, is it possible to obtain a visualization of this 'matrix'? with matplotlib or seaborn, for example.

a track to solve it? thanks in advance.

I would do this with Counters:

operations = [(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]

from collections import defaultdict, Counter

intermediary = defaultdict(Counter)
for (_, src, tgt), count in source:
    intermediary[src][tgt] = count

letters = sorted({key for inner in intermediary.values() for key in inner} | set(intermediary.keys()))

confusion_matrix = [[intermediary[src][tgt] for tgt in letters] for src in letters]

The result looks something like this:

[[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 11, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

For plotting, refer to the answers of this question :

import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

df_cm = pd.DataFrame(confusion_matrix, letters, letters)
sn.set(font_scale=1.4) # for label size
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size

plt.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM