简体   繁体   中英

Java : How to remove all characters in String except a a-z,digits and German characters

I am working on a Spring-MVC application in which we are currently integrating OCR functionality. OCR's have a habit of throwing wild-characters for wrong detections and when there is an image in background. After processing the image, we have considerably good data available, but there are still some errors. We would like to process the output as follows

  1. Remove all single characters from the output String.
  2. Remove any and all characters other than AZ,az, German characters ie äöü, ÄÖÜ,ß.
  3. Spaces and digits should be left untouched.

Code :

  File imageFile = new File(fileLocation);

            BufferedImage img  = ImageIO.read(imageFile);
            BufferedImage blackNWhite = new BufferedImage(img.getWidth(),img.getHeight(),BufferedImage.TYPE_BYTE_BINARY);
            Graphics2D graphics = blackNWhite.createGraphics();
            graphics.drawImage(img, 0, 0, null);
            String blackAndWhiteImage =  zipLocation + String.valueOf(new BigInteger(130, random).toString(32))+".png";
            File outputfile = new File(blackAndWhiteImage);
            ImageIO.write(blackNWhite, "png", outputfile);

            ITesseract instance = new Tesseract();
            // Point to one folder above tessdata directory, must contain training data
            instance.setDatapath("/usr/share/tesseract-ocr/");
            // ISO 693-3 standard
            instance.setLanguage("deu");
            String result = instance.doOCR(outputfile);
            //System.out.println(result);
             result = result.replaceAll("\\P{ASCII}","");
            System.out.println("Result is "+result);
            return result;

Thank you.

Update

Wild characters left by the regex :

 |
| '(°Ul") 
_} °
=# '
( )
...................................__+_......_._._.__._._._+._._.

Ad. 1.
result.replaceAll("\\\\s[a-zA-ZöÖäÄüÜß]\\\\s", "");
Ad. 2.
result.replaceAll("[^a-zA-ZöÖäÄüÜß]", "");

This is the regex I finally used to solve this problem :

result = result.replaceAll("[^a-zA-Z0-9öÖäÄüÜß@\\s]", "");

Thank you.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM