In the context of optical character recognition, I will try to summarize my issue at best:
I have reference sentence and a prediction sentence.
With Levenshtein editops function , I made a list that contains a tuple which contains: a step type (insertion, replace, substitution), the character modified in reference sequence, a character modified in a prediction sequence, and finally the number of times these changes are made in all of the reference sentence (in fact, maximum number of occurrences where these pairs of errors return)
[(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]
Output example
Predicted e m t a r ...continue
Reference
e 1 11 0 0 0
m 0 0 0 0 0
t 0 0 0 8 0
a 2 0 0 0 0
r 0 0 0 0 0
...continue
or like this (without labels):
[[1 11 0 0 0
0 0 0 0 0
0 0 0 8 0
2 0 0 0 0
0 0 0 0 0]]
Note: the value 0 is replaced by default in this 'matrix' examples when a character error pair is not encountered.
a track to solve it? thanks in advance.
I would do this with Counters:
operations = [(('insert', 'e', 'm'), 11), (('insert', 't', 'a'), 8), (('insert', 'r', 'o'), 5), (('replace', 'a', 'e'), 2), (('replace', 't', 'T'), 1), (('replace', 'r', 'R'), 1), (('replace', 'M', 'm'), 1), (('delete', ' ', 'a'), 1), (('replace', 'p', 'o'), 1), (('replace', 't', 'a'), 1), (('replace', 'e', 'e'), 1), (('replace', ' ', 'r'), 1), (('insert', ' ', 'd'), 1), (('replace', ' ', 'd'), 1), (('replace', 'i', 'e'), 1), (('replace', 'l', 's'), 1)]
from collections import defaultdict, Counter
intermediary = defaultdict(Counter)
for (_, src, tgt), count in source:
intermediary[src][tgt] = count
letters = sorted({key for inner in intermediary.values() for key in inner} | set(intermediary.keys()))
confusion_matrix = [[intermediary[src][tgt] for tgt in letters] for src in letters]
The result looks something like this:
[[0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 11, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
For plotting, refer to the answers of this question :
import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt
df_cm = pd.DataFrame(confusion_matrix, letters, letters)
sn.set(font_scale=1.4) # for label size
sn.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size
plt.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.