I am working on an iOS Swift project that takes takes OCR data and then searches the text for key phrases. The OCR output looks like this:
INGREDIENTS WATER, BROWN SUGAR, RED RIPE
TOMATO CONCENTRATE, APPLE CIDERVINEGAR
W01CESTERSHlWSMJCE(WATERW4EGAR CORN
SYRUP, SALT, MOLASSE, SPICE, NATURAL FLAVOR
GARLIC POWDER, CARAMEL COLOR, ANCHOVIES
CFlSril,TAMARiN0), MOLASSES, LEMON JUICE,
ONION, HONEY, MODIFIED TAVIOCA STARCH,
When I search the string for "corn syrup", nothing is found. Searching for "corn" and "syrup" does produce positive results.
I have also tried
tesseract.recognizedText.stringByTrimmingCharactersInSet(NSCharacterSet.whitespaceAndNewlineCharacterSet())
to no avail.
Any thoughts on how to format this text for searching that would allow "corn syrup" to be identified? The qualifier is that only the exact phrase is useful - after all there are corn, corn starch, maple syrup, etc. as potential ingredients.
Thanks.
OK here is the solution that worked
'textView.text = tesseract.recognizedText.stringByReplacingOccurrencesOfString("\\n", withString: " ", options: NSStringCompareOptions.LiteralSearch, range: nil)'
I thought the initial code was accomplishing the same task.
If you want to search for "corn syrup", you most likely need to replace all new lines with spaces (and then ideally check for double spaces and replace with single space).
The quality of the character recognition is not very good and I think the text would deserve more maintenance before being used for searching. You might, for example split the phrases into array of individual strings, then trim spaces etc. from beginning and the end, perhaps you could use UITextChecker
to help identify misspelled terms and fix them...
That's because "corn syrup", which is the string you're looking for, is not the same as "corn\\nsyrup", which is what your wall of text is showing.
You could instead try searching for "corn\\nsyrup" or "corn \\nsyrup" instead.
Notice in your picture how "corn\\nsyrup" produces the same results that your wall of text is showing?
Also, your code to replace "\\n" by " " might not be working because it could be "corn\\n syrup", which will make it have 2 spaces in between.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.