简体   繁体   中英

C# Looking for similar needle in haystack (for OCR)

I've been working on an OCR program that accepts a photo with text in it (in this specific case, a driver's license) as well as a first name and a last name as arguments.

Once the software reads the id photo, I search for the first and last name in the recognized text. Unfortunately, as the image quality can be pretty low, it will sometimes not get the name quite right.

Is there a way I could look for a SIMILAR needle in a haystack? That is, look for any occurrences that are similar to the first/last name? For example:

Needle: campbell

Haystack: 
operaioxsllcence 
gcltdriver 
exries13NOV2020
carnpbeiljtttj
...

The string that would be close enough is "carnpbeil".

This is what I'm using now, and it only helps in very specific situations:

private bool SourceContains(string haystack, string needle)
    {
        bool ret = false;
        if (haystack.Contains(needle) ||
                haystack.Replace("l", "i").Contains(needle) ||
                haystack.Replace("i", "l").Contains(needle) ||
                haystack.Replace("0", "o").Contains(needle) ||
                haystack.Replace("o", "0").Contains(needle) ||
                haystack.Replace("j", "d").Contains(needle) ||
                haystack.Replace("d", "j").Contains(needle) ||
                haystack.Replace("i", "j").Contains(needle) ||
                haystack.Replace("j", "i").Contains(needle) ||
                haystack.Replace("e", "f").Contains(needle) ||
                haystack.Replace("f", "e").Contains(needle) ||
                haystack.Replace("r", "p").Contains(needle) ||
                haystack.Replace("p", "r").Contains(needle) ||
                haystack.Replace("s", "r").Contains(needle) ||
                haystack.Replace("r", "s").Contains(needle) ||
                haystack.Replace("r", "n").Contains(needle) ||
                haystack.Replace("n", "r").Contains(needle) ||
                haystack.Replace("k", "n").Contains(needle) ||
                haystack.Replace("n", "k").Contains(needle) ||
                haystack.Replace("h", "n").Contains(needle) ||
                haystack.Replace("n", "h").Contains(needle) ||
                haystack.Replace("k", "ll").Contains(needle) ||
                haystack.Replace("ll", "k").Contains(needle) ||
                haystack.Replace("ci", "d").Contains(needle) ||
                haystack.Replace("d", "ci").Contains(needle) ||
                haystack.Replace("cl", "d").Contains(needle) ||
                haystack.Replace("d", "cl").Contains(needle) ||
                haystack.Replace("m", "in").Contains(needle) ||
                haystack.Replace("in", "m").Contains(needle) ||
                haystack.Replace("rn", "m").Contains(needle) ||
                haystack.Replace("m", "rn").Contains(needle)
                )
        {
            ret = true;
        }
        return ret;
    }

For each word in haystack calculate the levenshtein distance to needle . The word with the shortest distance is most likely to be your needle. Have a look at this question for implementations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM