[英]How to define tesseract_cmd to use Tesseract-OCR in AWS Lambda Functions
I am using AWS for process images and extract text with Tesseract and Python. In my backend, I uploaded the pytesseract library and the Tesseract-OCR folder.我正在使用 AWS 处理图像并使用 Tesseract 和 Python 提取文本。在我的后端,我上传了pytesseract库和Tesseract-OCR文件夹。 Locally it works very well, I neither need to change tesseract-cmd to find tesseract.exe .
在本地它工作得很好,我不需要更改tesseract-cmd来找到tesseract.exe 。 When I upload this folder to AWS Lambda, it returns one TesseractNotFound error saying that tesseract is not installed or it's not in your PATH.
当我将此文件夹上传到 AWS Lambda 时,它返回一个TesseractNotFound错误,指出未安装 tesseract 或它不在您的 PATH 中。 I already tried to change tesseract-cmd but I did not could solve it.
我已经尝试更改tesseract-cmd但我无法解决它。 My folder structure is /opt/python/lib/python3.7/site-packages and inside site-packages I have my libraries (Pillow, pytesseract, Tesseract-OCR).
我的文件夹结构是 /opt/python/lib/python3.7/site-packages,在 site-packages 里面我有我的库(Pillow、pytesseract、Tesseract-OCR)。 I already tried to create one new Lambda Function using this and this options but neither work.
我已经尝试使用这个和这个选项创建一个新的 Lambda Function 但都不起作用。 I think I can solve it using Environment Variables but I have no idea how to do it.
我想我可以使用环境变量来解决它,但我不知道该怎么做。
If someone knows how to do it in a better way that works I will accept as one answer too如果有人知道如何以更好的方式做到这一点,我也会接受一个答案
To solve this error I needed to make a bunch of things but in the end it works.为了解决这个错误,我需要做很多事情,但最终它起作用了。 As was commented, AWS Lambda runs in a Linux environment, so you will need to compile the libraries as you did for execute in a Linux environment.
正如所评论的那样,AWS Lambda 在 Linux 环境中运行,因此您需要像在 Linux 环境中执行时那样编译库。 In my case, I don't have one Linux machine to do it, so I followed the following steps:
就我而言,我没有一台 Linux 机器来做这件事,所以我按照以下步骤操作:
You can skip step 1 just downloading the files here您可以跳过第 1 步,只需在此处下载文件
1 - (If you don't have one Linux machine) I started one EC2 instance with Amazon Linux AMI, the basic instance will work very well. 1 - (如果你没有一台 Linux 机器)我用 Amazon Linux AMI 启动了一个 EC2 实例,基本实例运行良好。
sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #It will allow ec2-user to call docker
After the last code was executed, you need to restart you EC2 instance (just disconnect and reconnect)执行完最后一段代码后,您需要重启 EC2 实例(只需断开并重新连接)
git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #It will take some time
bash build_py37_pkgs.sh
After it, you will have one folder (lambda-tesseract-api) zipped with all files that you need.之后,您将拥有一个文件夹 (lambda-tesseract-api),其中包含您需要的所有文件。 In my case, I created one GitHub repository and uploaded all files to there, and then downloaded it on my computer to create my Lambda Layers.
就我而言,我创建了一个 GitHub 存储库并将所有文件上传到那里,然后将其下载到我的计算机上以创建我的 Lambda 层。
2 - After downloading the files you will upload the zip files to your Layers, one by one (open-cv, Pillow, tesseract, pytesseract) and the use the layers on your Lambda Function to run tesseract. 2 - 下载文件后,您将 zip 文件一个接一个地上传到您的层(open-cv、Pillow、tesseract、pytesseract),然后使用 Lambda Function 上的层来运行 tesseract。
This is the lambda-handler function that you will create to tesseract works.这是您将为 tesseract 作品创建的 lambda 处理程序 function。 (oem, psm and lang are tesseract parameters and you can learn more here )
(oem、psm 和 lang 是 tesseract 参数,您可以在此处了解更多信息)
import base64
import pytesseract
def ocr(img,oem=None,psm=None, lang=None):
config='--oem {} --psm {} -l {}'.format(oem,psm,lang)
ocr_text = pytesseract.image_to_string(img, config=config)
return ocr_text
def lambda_handler(event, context):
# Extract content from json body
body_image64 = event['image64']
oem = event["tess-params"]["oem"]
psm = event["tess-params"]["psm"]
lang = event["tess-params"]["lang"]
# Decode & save inp image to /tmp
with open("/tmp/saved_img.png", "wb") as f:
f.write(base64.b64decode(body_image64))
# Ocr
ocr_text = ocr("/tmp/saved_img.png",oem=oem,psm=psm,lang=lang)
# Return the result data in json format
return {
"ocr": ocr_text,
}
You will also need to set one Environment Variable .您还需要设置一个环境变量。 The key will be PYTHONPATH and the values will be /opt/
键将是PYTHONPATH ,值将是/opt/
Reference:参考:
https://medium.com/analytics-vidhya/build-tesseract-serverless-api-using-aws-lambda-and-docker-in-minutes-dd97a79b589b https://medium.com/analytics-vidhya/build-tesseract-serverless-api-using-aws-lambda-and-docker-in-minutes-dd97a79b589b
Tesseract OCR on AWS Lambda via virtualenv (Alex Albracht Answer) AWS 上的 Tesseract OCR Lambda 通过 virtualenv (Alex Albracht 回答)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.