简体   繁体   中英

Using Tesseract from Tika: the result contains line breaks only

I try to parse PNG file containing scanned text using Apache Tika and Tesseract for Windows.

Though running Tesseract from command line does recognise the text correctly, the content returned by Tika contains line breaks ("\\n") only.

This is my code:

ByteArrayInputStream inputstream = new ByteArrayInputStream(document.getFileContent());
byte[] content = document.getFileContent();
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); //to process long files
Metadata metadata = new Metadata();

ParseContext parseContext = new ParseContext();
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("C:\\Program Files (x86)\\Tesseract-OCR");
config.setTessdataPath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");
config.setMaxFileSizeToOcr(Integer.MAX_VALUE);
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(Parser.class, parser);

parser.parse(inputstream, handler, metadata, parseContext);

String contentString = handler.toString();
System.out.println(contentString);      

I tried to debug and found that TesseractOCRParser.doOcr() should run a process executing command like that:

tesseract C:\Users\admin\AppData\Local\Temp\apache-tika-6655676641285964446.tmp C:\Users\admin\AppData\Local\Temp\apache-tika-2151149415666715558.tmp -l eng -psm 1 txt

However, it looks like the process does not run. If I run the same command from another session, the recognised content comes.

I have found that the problem was in this line:

config.setTessdataPath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");

This line should be omitted and the parser will find the right path.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM