Tesseract OCR not working in Java on Linux

Question

I deployed a war file to my server with Java working in the backend. I'm trying to get Tesseract to work in Java on CentOS, and it simply won't work. It works perfectly on my Windows localhost, though. The code I have is:

private void doOCR(File file) // The image file
{
    InputStream stream = new FileInputStream(file);

    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();

    TesseractOCRConfig config = new TesseractOCRConfig();
    config.setTesseractPath(TESSERACT_PATH);
    // Path on Windows is C://Tesseract-ocr and path on Linux is /usr/local/bin
    context.set(TesseractOCRConfig.class, config);

    TesseractOCRParser tessParser = new TesseractOCRParser();       
    tessParser.parse(stream, handler, metadata, context);
    stream.close();
    System.out.println(handler.toString()); // handler.toString() prints extracted text
}

This code works on Windows, but not on Linux. I can do Tesseract from the command line, however, and the output file contains the correct text. Tesseract just won't work from Java on Linux. Is there anything I am missing here? Thanks!

Answer 1

Ok, I figured out my problem. On Linux, the tesseract files are stored in many different locations (ie some are in etc/tomcat6, some are in var/lib/tomcat6, etc.). On my Windows machine, all the files are stored in the same folder (Tesseract-ocr). I had the path set to the tesseract executable on both machines, but I also needed to have all tesseract data files in the same location. Making this change fixed the problem.

Tesseract OCR not working in Java on Linux

Question

1 answers

solution1
0 2015-07-23 12:32:22

Tesseract OCR not working in Java on Linux

Question

1 answers

solution1 0 2015-07-23 12:32:22

solution1
0 2015-07-23 12:32:22