简体   繁体   English

如何定义 tesseract_cmd 以在 AWS Lambda 函数中使用 Tesseract-OCR

[英]How to define tesseract_cmd to use Tesseract-OCR in AWS Lambda Functions

I am using AWS for process images and extract text with Tesseract and Python. In my backend, I uploaded the pytesseract library and the Tesseract-OCR folder.我正在使用 AWS 处理图像并使用 Tesseract 和 Python 提取文本。在我的后端,我上传了pytesseract库和Tesseract-OCR文件夹。 Locally it works very well, I neither need to change tesseract-cmd to find tesseract.exe .在本地它工作得很好,我不需要更改tesseract-cmd来找到tesseract.exe When I upload this folder to AWS Lambda, it returns one TesseractNotFound error saying that tesseract is not installed or it's not in your PATH.当我将此文件夹上传到 AWS Lambda 时,它返回一个TesseractNotFound错误,指出未安装 tesseract 或它不在您的 PATH 中。 I already tried to change tesseract-cmd but I did not could solve it.我已经尝试更改tesseract-cmd但我无法解决它。 My folder structure is /opt/python/lib/python3.7/site-packages and inside site-packages I have my libraries (Pillow, pytesseract, Tesseract-OCR).我的文件夹结构是 /opt/python/lib/python3.7/site-packages,在 site-packages 里面我有我的库(Pillow、pytesseract、Tesseract-OCR)。 I already tried to create one new Lambda Function using this and this options but neither work.我已经尝试使用这个这个选项创建一个新的 Lambda Function 但都不起作用。 I think I can solve it using Environment Variables but I have no idea how to do it.我想我可以使用环境变量来解决它,但我不知道该怎么做。

error错误

my folder structure我的文件夹结构

If someone knows how to do it in a better way that works I will accept as one answer too如果有人知道如何以更好的方式做到这一点,我也会接受一个答案

To solve this error I needed to make a bunch of things but in the end it works.为了解决这个错误,我需要做很多事情,但最终它起作用了。 As was commented, AWS Lambda runs in a Linux environment, so you will need to compile the libraries as you did for execute in a Linux environment.正如所评论的那样,AWS Lambda 在 Linux 环境中运行,因此您需要像在 Linux 环境中执行时那样编译库。 In my case, I don't have one Linux machine to do it, so I followed the following steps:就我而言,我没有一台 Linux 机器来做这件事,所以我按照以下步骤操作:

You can skip step 1 just downloading the files here您可以跳过第 1 步,只需在此处下载文件

1 - (If you don't have one Linux machine) I started one EC2 instance with Amazon Linux AMI, the basic instance will work very well. 1 - (如果你没有一台 Linux 机器)我用 Amazon Linux AMI 启动了一个 EC2 实例,基本实例运行良好。

sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #It will allow ec2-user to call docker

After the last code was executed, you need to restart you EC2 instance (just disconnect and reconnect)执行完最后一段代码后,您需要重启 EC2 实例(只需断开并重新连接)

git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #It will take some time
bash build_py37_pkgs.sh

After it, you will have one folder (lambda-tesseract-api) zipped with all files that you need.之后,您将拥有一个文件夹 (lambda-tesseract-api),其中包含您需要的所有文件。 In my case, I created one GitHub repository and uploaded all files to there, and then downloaded it on my computer to create my Lambda Layers.就我而言,我创建了一个 GitHub 存储库并将所有文件上传到那里,然后将其下载到我的计算机上以创建我的 Lambda 层。

2 - After downloading the files you will upload the zip files to your Layers, one by one (open-cv, Pillow, tesseract, pytesseract) and the use the layers on your Lambda Function to run tesseract. 2 - 下载文件后,您将 zip 文件一个接一个地上传到您的层(open-cv、Pillow、tesseract、pytesseract),然后使用 Lambda Function 上的层来运行 tesseract。

This is the lambda-handler function that you will create to tesseract works.这是您将为 tesseract 作品创建的 lambda 处理程序 function。 (oem, psm and lang are tesseract parameters and you can learn more here ) (oem、psm 和 lang 是 tesseract 参数,您可以在此处了解更多信息)

import base64
import pytesseract

def ocr(img,oem=None,psm=None, lang=None):
    
  config='--oem {} --psm {} -l {}'.format(oem,psm,lang)
  ocr_text = pytesseract.image_to_string(img, config=config)
    
  return ocr_text
      
def lambda_handler(event, context):
    
    # Extract content from json body
    body_image64 = event['image64']
    oem = event["tess-params"]["oem"]
    psm = event["tess-params"]["psm"]
    lang = event["tess-params"]["lang"]
    
    # Decode & save inp image to /tmp
    with open("/tmp/saved_img.png", "wb") as f:
      f.write(base64.b64decode(body_image64))
    
    # Ocr
    ocr_text = ocr("/tmp/saved_img.png",oem=oem,psm=psm,lang=lang)
    
    # Return the result data in json format
    return {
      "ocr": ocr_text,
    }

You will also need to set one Environment Variable .您还需要设置一个环境变量 The key will be PYTHONPATH and the values will be /opt/键将是PYTHONPATH ,值将是/opt/

Reference:参考:

https://medium.com/analytics-vidhya/build-tesseract-serverless-api-using-aws-lambda-and-docker-in-minutes-dd97a79b589b https://medium.com/analytics-vidhya/build-tesseract-serverless-api-using-aws-lambda-and-docker-in-minutes-dd97a79b589b

Tesseract OCR on AWS Lambda via virtualenv (Alex Albracht Answer) AWS 上的 Tesseract OCR Lambda 通过 virtualenv (Alex Albracht 回答)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM