AWS 上的 Tesseract OCR Lambda 通过 virtualenv

Question

I have spent all week attempting this, so this is a bit of a hail mary.我整个星期都在尝试这个，所以这有点像冰雹玛丽。

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also using PILLOW for image pre-processing, hence the choice of Python).我正在尝试将 Tesseract OCR package 升级到运行在 Python 上的 AWS Lambda（我还使用 PILLOW 进行图像预处理，因此选择了 Python）。

I understand how to deploy Python packages onto AWS using virtualenv, however I cannot seem to find a way of deploying the actual Tesseract OCR into the environment (eg /env/)我了解如何使用 virtualenv 将 Python 包部署到 AWS 上，但是我似乎找不到将实际 Tesseract OCR 部署到环境中的方法（例如 /env/）

Doing pip install py-tesseract results in a successful deployment of the python wrapper into /env/, however this relies on a separate (local) install of Tesseract执行pip install py-tesseract会成功将 python 包装器部署到 /env/，但这依赖于 Tesseract 的单独（本地）安装
Doing pip install tesseract-ocr gets me only a certain distance before it errors out as follows which I am assuming is due to a missing leptonica dependency.执行pip install tesseract-ocr让我在出现如下错误之前只有一定距离，我假设这是由于缺少 leptonica 依赖性。 However, I have no idea how to package up leptonica into /env/ (if that is even possible)但是，我不知道如何将 package up leptonica 放入 /env/（如果可能的话）

 tesseract_ocr.cpp:264:10: fatal error: 'leptonica/allheaders.h' file not found #include "leptonica/allheaders.h"

Downloading 0.9.1 python-tesseract egg file from https://bitbucket.org/3togo/python-tesseract/downloads and doing easy_install also errors out when looking for dependencies从https://bitbucket.org/3togo/python-tesseract/downloads下载 0.9.1 python-tesseract egg 文件并执行 easy_install 在查找依赖项时也会出错

Processing dependencies for python-tesseract==0.9.1 Searching for python-tesseract==0.9.1 Reading https://pypi.python.org/simple/python-tesseract/ Couldn't find index page for 'python-tesseract' (maybe misspelled?) Scanning index of all packages (this may take a while) Reading https://pypi.python.org/simple/ No local packages or download links found for python-tesseract==0.9.1

Any pointers would be greatly appreciated.任何指针将不胜感激。

Answer 1

The reason it's not working is because these python packages are only wrappers to tesseract.它不起作用的原因是因为这些 python 包只是 tesseract 的包装器。 You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.您必须使用 AWS Linux 实例编译 tesseract，并将二进制文件和库复制到 lambda 函数的 zip 文件中。

1) Start an EC2 instance with 64-bit Amazon Linux; 1) 使用 64 位 Amazon Linux 启动 EC2 实例；

2) Install dependencies: 2）安装依赖：

sudo yum install gcc gcc-c++ make
sudo yum install autoconf aclocal automake
sudo yum install libtool
sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel

3) Compile and install leptonica: 3）编译安装leptonica：

cd ~
mkdir leptonica
cd leptonica
wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
tar -zxvf leptonica-1.73.tar.gz
cd leptonica-1.73
./configure
make
sudo make install

4) Compile and install tesseract 4）编译安装tesseract

cd ~
mkdir tesseract
cd tesseract
wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
tar -zxvf 3.04.01.tar.gz
cd tesseract-3.04.01
./autogen.sh
./configure
make
sudo make install

5) Download language traineddata to tessdata 5) 下载语言traineddata到tessdata

cd /usr/local/share/tessdata
wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
export TESSDATA_PREFIX=/usr/local/share/

At this point you should be able to use tesseract on this EC2 instance.此时您应该能够在这个 EC2 实例上使用 tesseract。 To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda.要复制 tesseract 的二进制文件并在 lambda 函数上使用它，您需要将此实例中的一些文件复制到您上传到 lambda 的 zip 文件中。 I'll post all the commands to get a zip file with all the files you need.我将发布所有命令以获取包含您需要的所有文件的 zip 文件。

6) Zip all the stuff you need to run tesseract on lambda 6) 压缩在 lambda 上运行 tesseract 所需的所有东西

cd ~
mkdir tesseract-lambda
cd tesseract-lambda
cp /usr/local/bin/tesseract .
mkdir lib
cd lib
cp /usr/local/lib/libtesseract.so.3 .
cp /usr/local/lib/liblept.so.5 .
cp /usr/lib64/libpng12.so.0 .
cd ..

mkdir tessdata
cd tessdata
cp /usr/local/share/tessdata/eng.traineddata .
cd ..

cd ..
zip -r tesseract-lambda.zip tesseract-lambda

The tesseract-lambda.zip file have everything lambda needs to run tesseract. tesseract-lambda.zip 文件包含 lambda 运行 tesseract 所需的一切。 The last thing to do is add the lambda function at the root of the zip file and upload it to lambda.最后要做的是在 zip 文件的根目录添加 lambda 函数并将其上传到 lambda。 Here is an example that I have not tested, but should work.这是一个我没有测试过的例子，但应该可以工作。

7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip: 7) 创建一个名为 main.py 的文件，编写一个与上面类似的 lambda 函数并将其添加到 tesseract-lambda.zip 的根目录：

from __future__ import print_function

import urllib
import boto3
import os
import subprocess

SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')

s3 = boto3.client('s3')

def lambda_handler(event, context):

    # Get the bucket and object from the event
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')

    try:
        print("Bucket: " + bucket)
        print("Key: " + key)

        imgfilepath = '/tmp/image.png'
        jsonfilepath = '/tmp/result.txt'
        exportfile = key + '.txt'

        print("Export: " + exportfile)

        s3.download_file(bucket, key, imgfilepath)

        command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
            LIB_DIR,
            SCRIPT_DIR,
            SCRIPT_DIR,
            imgfilepath,
            jsonfilepath,
        )

        try:
            output = subprocess.check_output(command, shell=True)
            print(output)
            s3.upload_file(jsonfilepath, bucket, exportfile)
        except subprocess.CalledProcessError as e:
            print(e.output)

    except Exception as e:
        print(e)
        print('Error processing object {} from bucket {}.'.format(key, bucket))
        raise e

When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler.在 AWS 控制台上创建 AWS Lambda 函数时，上传 zip 文件并将 Hanlder 设置为 main.lambda_handler。 This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.这将告诉 AWS Lambda 在 zip 中查找 main.py 文件并调用函数 lambda_handler。

IMPORTANT重要的

From time to time things change in AWS Lambda's environment. AWS Lambda 的环境不时发生变化。 For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer).例如，lambda env 的当前图像是 amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2（当您阅读此答案时，它可能不是这个）。 If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).如果 tesseract 开始返回分段错误，请在 Lambda 函数上运行“ldd tesseract”并查看所需库的输出（当前为 libtesseract.so.3 liblept.so.5 libpng12.so.0）。

Thanks for the comment, SergioArcos.感谢您的评论，塞尔吉奥阿科斯。

Answer 2

Adapatations for tesseract 4 : tesseract 4 的适应：

Tesseract offers much improvements in version 4, thanks to a neural network.由于神经网络，Tesseract 在第 4 版中提供了很多改进。 I've tried it with some scans and the improvements are quite substantial.我已经尝试了一些扫描，并且改进是相当可观的。 Plus the whole package was 25% smaller in my case.另外，在我的情况下，整个包装小了 25%。 Planned release date of version 4 is first half of 2018 .版本 4 的计划发布日期是2018 年上半年。

The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full.构建步骤与 tesseract 3 类似，但有一些调整，这就是我想完整分享它们的原因。 I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step .我还用现成的二进制文件制作了一个github存储库（其中大部分基于上面 Jose 的帖子，这非常有帮助），还有一篇博客文章如何在 raspberrypi3 驱动的扫描仪步骤之后将其用作处理步骤。

To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:要编译 tesseract4 二进制文件，请在新的 64 位 AWS AIM 实例上执行以下步骤：

Compile leptonica编译leptonica

cd ~
sudo yum install clang -y
sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
tar -xzvf leptonica-1.75.1.tar.gz
cd leptonica-1.75.1
./configure && make && sudo make install

Compile autoconf-archive编译 autoconf-archive

Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:不幸的是，由于几个星期 tesseract 需要 autoconf-archive，这不适用于亚马逊 AIM，因此您需要自己编译它：

cd ~
wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
tar -xvf autoconf-archive-2017.09.28.tar.xz
cd autoconf-archive-2017.09.28
./configure && make && sudo make install
sudo cp m4/* /usr/share/aclocal/

Compile tesseract编译tesseract

cd ~
sudo yum install git-core libtool pkgconfig -y
git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
cd tesseract-ocr
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
./autogen.sh
./configure
make
sudo make install

Get all needed files and zip获取所有需要的文件并压缩

cd ~
mkdir tesseract-standalone
cd tesseract-standalone
cp /usr/local/bin/tesseract .
mkdir lib
cp /usr/local/lib/libtesseract.so.4 lib/
cp /usr/local/lib/liblept.so.5 lib/
cp /usr/lib64/libjpeg.so.62 lib/
cp /usr/lib64/libwebp.so.4 lib/
cp /usr/lib64/libstdc++.so.6 lib/
mkdir tessdata
cd tessdata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
# additionally any other language you want to use, e.g. `deu` for Deutsch
mkdir configs
cp /usr/local/share/tessdata/configs/pdf configs/
cp /usr/local/share/tessdata/pdf.ttf .
cd ..
zip -r ~/tesseract-standalone.zip *

Answer 3

Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7使用 shell 脚本生成 zip 文件以编译代码 Tesseract 4 for Python 3.7

I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function.几天来，我一直在努力解决这个问题，试图让 Tesseract 4 在 Python 3.7 Lambda 函数上工作。 Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2!最后我找到了这篇文章和GitHub ，它描述了如何使用 shell 脚本为 tesseract、pytesseract、opencv 和枕头生成 zip 文件，这些脚本在 EC2 上使用 Docker 映像生成必要的 .zip 文件！ This process takes less than 20 minutes using these steps and is reliably reproducible.使用这些步骤，该过程只需不到 20 分钟，并且可以可靠地重现。

Summarized Steps:总结步骤：

Start an Amazon Linux EC2 instance (t2 micro will do just fine)启动一个 Amazon Linux EC2 实例（t2 micro 就可以了）

sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #allows ec2-user to call docker

After running the 5th command you will need to logout and log back in for the change to take effect.运行第 5 个命令后，您需要注销并重新登录以使更改生效。

git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #takes a few minutes
bash build_py37_pkgs.sh

This will generate .zip files for tesseract, pytesseract, pillow, and opencv.这将为 tesseract、pytesseract、pillow 和 opencv 生成 .zip 文件。 In order to use with lambda you need to complete two more steps.为了与 lambda 一起使用，您还需要完成两个步骤。

Create Lambda layers, one for each zip file, and attach the layers to your Lambda function.创建 Lambda 层，每个 zip 文件一个，并将这些层附加到您的 Lambda 函数。
Create an Environment Variable.创建环境变量。 Key : PYTHONPATH and Value : /opt/键：PYTHONPATH 和值：/opt/

(Note: you will probably need to increase your Memory allocation and Timeout) （注意：您可能需要增加内存分配和超时）

At this point you are all set to upload your code and start using Tesseract on AWS Lambda!此时，您已准备好上传代码并开始在 AWS Lambda 上使用 Tesseract！ Refer back to the Medium article for a test script.有关测试脚本，请参阅Medium 文章。

Answer 4

Check this medium article on how to setup Tesseract 4.0.0 in lambda using Docker.查看这篇关于如何使用 Docker 在 lambda 中设置 Tesseract 4.0.0 的中等文章。 It shows also how to convert python packages into layers它还展示了如何将 python 包转换为层

Answer 5

Note that wget http://www.leptonica.com/source/leptonica-1.73.tar.gz does not work.请注意， wget http://www.leptonica.com/source/leptonica-1.73.tar.gz不起作用。 They've move to leptonica.org so use wget http://www.leptonica.org/source/leptonica-1.83.0.tar.gz他们已经转移到 leptonica.org 所以使用wget http://www.leptonica.org/source/leptonica-1.83.0.tar.gz

AWS 上的 Tesseract OCR Lambda 通过 virtualenv

问题描述

5 个解决方案

解决方案1
51 已采纳 2016-03-01 13:58:02

解决方案2
8 2018-02-09 08:11:13

Compile leptonica编译leptonica

Compile autoconf-archive编译 autoconf-archive

Compile tesseract编译tesseract

Get all needed files and zip获取所有需要的文件并压缩

解决方案3
4 2020-03-09 20:19:23

Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7使用 shell 脚本生成 zip 文件以编译代码 Tesseract 4 for Python 3.7

解决方案4
3 2020-02-29 16:49:38

解决方案5
0

AWS 上的 Tesseract OCR Lambda 通过 virtualenv

问题描述

5 个解决方案

解决方案1 51 已采纳 2016-03-01 13:58:02

解决方案2 8 2018-02-09 08:11:13

Compile leptonica编译leptonica

Compile autoconf-archive编译 autoconf-archive

Compile tesseract编译tesseract

Get all needed files and zip获取所有需要的文件并压缩

解决方案3 4 2020-03-09 20:19:23

Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7使用 shell 脚本生成 zip 文件以编译代码 Tesseract 4 for Python 3.7

解决方案4 3 2020-02-29 16:49:38

解决方案5 0

解决方案1
51 已采纳 2016-03-01 13:58:02

解决方案2
8 2018-02-09 08:11:13

解决方案3
4 2020-03-09 20:19:23

解决方案4
3 2020-02-29 16:49:38

解决方案5
0