Python OCR Module in Linux?

Question

I want to find a easy-to-use OCR python module in linux, I have found pytesser http://code.google.com/p/pytesser/ , but it contains a .exe executable file.

I tried changed the code to use wine, and it really works, but it's too slow and really not a good idea.

Is there any Linux alternatives that as easy-to-use as it?

Answer 1

You can just wrap tesseract in a function:

import os
import tempfile
import subprocess

def ocr(path):
    temp = tempfile.NamedTemporaryFile(delete=False)

    process = subprocess.Popen(['tesseract', path, temp.name], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    process.communicate()

    with open(temp.name + '.txt', 'r') as handle:
        contents = handle.read()

    os.remove(temp.name + '.txt')
    os.remove(temp.name)

    return contents

If you want document segmentation and more advanced features, try out OCRopus .

Answer 2

In addition to Blender's answer, that just executs Tesseract executable, I would like to add that there exist other alternatives for OCR that can also be called as external process.

ABBYY comand line OCR utility: http://ocr4linux.com/en:start

It is not free, so worth to consider only if Tesseract accuracy is not good enough for your task, or you need more sophisticated layout analisys or you need to export PDF, Word and other files.

Update: here's comparison of ABBYY and tesseract accuracy: http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison

Disclaimer: I work for ABBYY

Answer 3

python tesseract

http://code.google.com/p/python-tesseract

import cv2.cv as cv
import tesseract

api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetPageSegMode(tesseract.PSM_AUTO)

image=cv.LoadImage("eurotext.jpg", cv.CV_LOAD_IMAGE_GRAYSCALE)
tesseract.SetCvImage(image,api)
text=api.GetUTF8Text()
conf=api.MeanTextConf()

Answer 4

You should try the excellent scikits.learn libraries for machine learning. You can find two codes that are ready to run here and here .

Answer 5

You have a bunch of options here.

One way, as others pointed out is to use tesseract. Looks like there are a bunch of wrappers by now, so best way is to do a quick pypi search for it. The most used ones these days are:

Another useful site for finding similar engines is alternative.to . A few linux based systems according to them are:

ABBYY
Tesseract
CuneiForm
Ocropus
GOCR

Python OCR Module in Linux?

Question

5 answers

solution1
16 ACCPTED 2011-04-27 05:56:37

solution2
11 2011-04-27 07:14:11

solution3
6 2012-08-13 18:06:45

solution4
1 2012-05-23 20:20:41

solution5
0 2014-11-20 17:09:01

Python OCR Module in Linux?

Question

5 answers

solution1 16 ACCPTED 2011-04-27 05:56:37

solution2 11 2011-04-27 07:14:11

solution3 6 2012-08-13 18:06:45

solution4 1 2012-05-23 20:20:41

solution5 0 2014-11-20 17:09:01

solution1
16 ACCPTED 2011-04-27 05:56:37

solution2
11 2011-04-27 07:14:11

solution3
6 2012-08-13 18:06:45

solution4
1 2012-05-23 20:20:41

solution5
0 2014-11-20 17:09:01