I am working on a Spring-MVC application in which we are currently integrating OCR functionality. OCR's have a habit of throwing wild-characters for wrong detections and when there is an image in background. After processing the image, we have considerably good data available, but there are still some errors. We would like to process the output as follows
Code :
File imageFile = new File(fileLocation);
BufferedImage img = ImageIO.read(imageFile);
BufferedImage blackNWhite = new BufferedImage(img.getWidth(),img.getHeight(),BufferedImage.TYPE_BYTE_BINARY);
Graphics2D graphics = blackNWhite.createGraphics();
graphics.drawImage(img, 0, 0, null);
String blackAndWhiteImage = zipLocation + String.valueOf(new BigInteger(130, random).toString(32))+".png";
File outputfile = new File(blackAndWhiteImage);
ImageIO.write(blackNWhite, "png", outputfile);
ITesseract instance = new Tesseract();
// Point to one folder above tessdata directory, must contain training data
instance.setDatapath("/usr/share/tesseract-ocr/");
// ISO 693-3 standard
instance.setLanguage("deu");
String result = instance.doOCR(outputfile);
//System.out.println(result);
result = result.replaceAll("\\P{ASCII}","");
System.out.println("Result is "+result);
return result;
Thank you.
Update
Wild characters left by the regex :
|
| '(°Ul")
_} °
=# '
( )
...................................__+_......_._._.__._._._+._._.
Ad. 1.
result.replaceAll("\\\\s[a-zA-ZöÖäÄüÜß]\\\\s", "");
Ad. 2.
result.replaceAll("[^a-zA-ZöÖäÄüÜß]", "");
This is the regex I finally used to solve this problem :
result = result.replaceAll("[^a-zA-Z0-9öÖäÄüÜß@\\s]", "");
Thank you.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.