简体   繁体   English

OCR页面上的模糊匹配词

[英]fuzzy matching word on OCR page

I have a static phrase the I am searching an OCR'd image for. 我有一个静态短语,我正在搜索OCR图像。

string KeywordToFind = "Account Number"

string OcrPageText = "
GEORGIA
POWER

A SOUTHERN COMPANY

AecountNumber

122- 493

Pagel of2

Please Pay By
Jan 29,2014

Total Due
39.11
"

How can I find the word "AecountNumber" using my keyword "Account Number"? 如何使用关键字“帐户号”找到“ AecountNumber”一词?

I have tried using variations of the Levenshtein Distance Algorithm HERE with varied success. 我已经使用了莱文斯坦距离算法的变化试图这里有不同程度的成功。 I've also tried regexes, but the OCR often converts the text differently, thus rendering the regex useless. 我也尝试过正则表达式,但是OCR经常以不同的方式转换文本,因此使正则表达式无用。

Suggestions? 有什么建议吗? I can provide more code if the link doesn't give enough information. 如果链接没有提供足够的信息,我可以提供更多代码。 Also, Thanks! 另外,谢谢!

Why not try something mostly arbitrary, like this -- while it would certainly match a lot more than just account number, the chances of the start and end characters existing elsewhere in that order is pretty slim. 为什么不尝试这样大体上任意的东西-虽然它肯定会匹配很多而不只是帐号,但是以其他顺序出现在开头和结尾字符的可能性很小。

A.?c.?.?nt ?N.?[mn]b.?r

http://regex101.com/r/zV1yM2 http://regex101.com/r/zV1yM2

It'll match things like: 它将匹配以下内容:

Account Number
AccntNumbr
Aecnt Nunber

Answered My Question with the use of sub-strings. 使用子字符串回答了我的问题。 Posting in case others run into the same type of problem. 如果其他人遇到相同类型的问题,请发布。 A little unorthodox, but it works great for me. 有点不合常规,但对我来说效果很好。

int TextLengthBuffer = (int)StaticTextLength - 1; //start looking for correct result with one less character than it should have.
int LowestLevenshteinNumber = 999999; //initialize insanely high maximum
decimal PossibleStringLength = (PossibleString.Length); //Length of string to search
decimal StaticTextLength = (StaticText.Length); //Length of text to search for
decimal NumberOfErrorsAllowed = Math.Round((StaticTextLength * (ErrorAllowance / 100)), MidpointRounding.AwayFromZero); //Find number of errors allowed with given ErrorAllowance percentage

    //Look for best match with 1 less character than it should have, then the correct amount of characters.
    //And last, with 1 more character. (This is because one letter can be recognized as 
    //two (W -> VV) and visa versa) 

for (int i = 0; i < 3; i++) 
{
    for (int e = TextLengthBuffer; e <= (int)PossibleStringLength; e++)
    {
        string possibleResult = (PossibleString.Substring((e - TextLengthBuffer), TextLengthBuffer));
        int lAllowance = (int)(Math.Round((possibleResult.Length - StaticTextLength) + (NumberOfErrorsAllowed), MidpointRounding.AwayFromZero));
        int lNumber = LevenshteinAlgorithm(StaticText, possibleResult);

        if (lNumber <= lAllowance && ((lNumber < LowestLevenshteinNumber) || (TextLengthBuffer == StaticText.Length && lNumber <= LowestLevenshteinNumber)))
        {
            PossibleResult = (new StaticTextResult { text = possibleResult, errors = lNumber });
            LowestLevenshteinNumber = lNumber;
        }
    }
    TextLengthBuffer++;
}




public static int LevenshteinAlgorithm(string s, string t) // Levenshtein Algorithm
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];

    if (n == 0)
    {
        return m;
    }

    if (m == 0)
    {
        return n;
    }

    for (int i = 0; i <= n; d[i, 0] = i++)
    {
    }

    for (int j = 0; j <= m; d[0, j] = j++)
    {
    }

    for (int i = 1; i <= n; i++)
    {
        for (int j = 1; j <= m; j++)
        {
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    return d[n, m];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM