简体繁体 English

嘈杂文本的字符串匹配算法

[英]String-matching algorithm for noisy text

原文 2014-11-02 01:21:37 7 2 string/ substring/ ocr

I have used OCR (optical character recognition) to get texts from images. 我已经使用OCR（光学字符识别）从图像中获取文本。 The images contain book covers. 图像包含书的封面。 Because of the images are so noisy, some characters are misrecognised, or some noises are recognised as a character. 由于图像太吵杂，某些字符被错误识别，或者某些噪音被识别为字符。

Examples: 例子：

"w COMPUTER Nnwonxs i I "(Compuer Networks) “ w COMPUTER Nnwonxs i I”（计算机网络）
"s.ll NEURAL NETWORKS C "(Neural Networks) “ s.ll神经网络C”（神经网络）
"1llllll INFRODUCIION ro PROBABILITY ti iitiiili My "(Introduction of Probability) “ 1llllll基础概率论”（概率概论）

I builded a dictionary with words, but i want to somehow match the recognised text with the dictionary. 我用单词建立了一个词典，但是我想以某种方式将识别的文本与词典匹配。 I tried LCS (Longest Common subsequence), but its not so effective. 我尝试了LCS（最长公共子序列），但效果不佳。

What is the best string matching algorithm for this kind of problem? 解决此类问题的最佳字符串匹配算法是什么？ (So a part of string is just noise, but also the important part of string can has some misrecognised characters) （因此，字符串的一部分只是杂音，而且字符串的重要部分也可能包含一些误识别的字符）

2 个解决方案

That's really a big question. 这确实是一个大问题。 Followings are something I know about it. 以下是我所知道的。 For more details, you can read some related papers. 有关更多详细信息，您可以阅读一些相关文章。

For single word, use Hamming Distance to calculate the similarity between the word your recognized by OCR and those in your dictionary; 对于单个单词，使用汉明距离来计算OCR识别的单词与词典中的单词之间的相似度；

this step is used to correct the the words have been recognized by OCR but do not exist. 此步骤用于更正OCR已识别但不存在的单词。

Eg： If the result of OCR is INFRODUCIION which dosen't exist in your dictionary, you can find out the Hamming Distance of word 'INTRODUCTION' is 2. So it may be mis-recognized as 'INFRODUCIION'. 例如：如果OCR的结果是INFRODUCIION，而您的词典中不存在该信息，则可以发现单词'INTRODUCTION'的汉明距离为2。因此，它可能会被误认为是'INFRODUCIION'。 However, the same word may be recognized as different words with the same Hamming Distance between them. 但是，相同的单词可能会被识别为具有相同汉明距离的不同单词。

Eg： If the result of OCR is the CAY, you may find CAR and CAT are both with the same Hamming Distance of 1, so that will be confused. 例如：如果OCR的结果是CAY，您可能会发现CAR和CAT的汉明距离都为1，因此会造成混淆。

In this case, there are several things can be used for analyze: 在这种情况下，可以使用以下几项进行分析：

Still for single word, the image different between CAT and CAY is less that CAR and CAY. 对于单个单词，CAT和CAY之间的图像差异要小于CAR和CAY。 So for this reason, CAT seems the right word with a greater probability. 因此，由于这个原因，CAT看起来是正确的词，可能性更大。
Then let us the context to caculate another probability. 然后让我们根据上下文来计算另一个概率。 If the whold sentence is 'I drove my new CAY this morning', as for people usually drive a CAR but not a CAT, we have a better chance to regard the word CAY as CAR but not CAT. 如果说“我今天早上开车我的新CAY”是一句话，那么对于人们通常驾驶CAR而不是CAT的人，我们有更好的机会将CAY视为CAR而不是CAT。
For the frequency of the words used in the similar articles, use TF-TDF. 对于类似文章中使用的单词的频率，请使用TF-TDF。

Are you saying you have a dictionary that defines all words that are acceptable? 您是说您有一本定义所有可接受单词的词典吗？

If so, it should be fairly straight forward to take each word and find the closest match in your dictionary. 如果是这样，那么拿每个单词并在字典中找到最接近的匹配应该是相当简单的。 Set a match threshold and discard the word if it does not reach the threshold. 设置匹配阈值，如果单词未达到阈值，则将其丢弃。

I would experiment with the Soundex and Metaphone algorithms or the Levenshtein Distance algorithm . 我将尝试使用Soundex和Metaphone算法或Levenshtein距离算法。