Using Tesseract from Tika: the result contains line breaks only

Question

I try to parse PNG file containing scanned text using Apache Tika and Tesseract for Windows.

Though running Tesseract from command line does recognise the text correctly, the content returned by Tika contains line breaks ("\\n") only.

This is my code:

ByteArrayInputStream inputstream = new ByteArrayInputStream(document.getFileContent());
byte[] content = document.getFileContent();
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); //to process long files
Metadata metadata = new Metadata();

ParseContext parseContext = new ParseContext();
TesseractOCRConfig config = new TesseractOCRConfig();
config.setTesseractPath("C:\\Program Files (x86)\\Tesseract-OCR");
config.setTessdataPath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");
config.setMaxFileSizeToOcr(Integer.MAX_VALUE);
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(Parser.class, parser);

parser.parse(inputstream, handler, metadata, parseContext);

String contentString = handler.toString();
System.out.println(contentString);

I tried to debug and found that TesseractOCRParser.doOcr() should run a process executing command like that:

tesseract C:\Users\admin\AppData\Local\Temp\apache-tika-6655676641285964446.tmp C:\Users\admin\AppData\Local\Temp\apache-tika-2151149415666715558.tmp -l eng -psm 1 txt

However, it looks like the process does not run. If I run the same command from another session, the recognised content comes.

Answer 1

I have found that the problem was in this line:

config.setTessdataPath("C:\\Program Files (x86)\\Tesseract-OCR\\tessdata");

This line should be omitted and the parser will find the right path.

Using Tesseract from Tika: the result contains line breaks only

Question

1 answers

solution1
0 2017-03-14 08:38:03

Using Tesseract from Tika: the result contains line breaks only

Question

1 answers

solution1 0 2017-03-14 08:38:03

solution1
0 2017-03-14 08:38:03