简体   繁体   English

如何安装要在 AWS Lambda 上使用的 Poppler

[英]How to install Poppler to be used on AWS Lambda

I have to run pdf2image on my Python Lambda Function in AWS, but it requires poppler and poppler-utils to be installed on the machine.我必须在 AWS 中的 Python Lambda 函数上运行pdf2image ,但它需要在机器上安装 poppler 和 poppler-utils。

I have tried to search in many different places how to do that but could not find anything or anyone that have done that using lambda functions.我试图在许多不同的地方搜索如何做到这一点,但找不到任何使用 lambda 函数完成此操作的人或任何人。

Would any of you know how to generate poppler binaries, put it on my Lambda package and tell Lambda to use that?你们中有人知道如何生成 poppler 二进制文件,将它放在我的 Lambda 包中并告诉 Lambda 使用它吗?

Thank you all.谢谢你们。

AWS lambda runs under an execution environment which includes software and libraries if anything you need is not there you need to install it to create an execution environment.Check the below link for more info , https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html AWS lambda 在包含软件和库的执行环境下运行,如果您不需要任何东西,您需要安装它来创建执行环境。查看以下链接了解更多信息, https://docs.aws.amazon.com/ lambda/latest/dg/current-supported-versions.html

for poppler follow this steps to create your own binary https://github.com/skylander86/lambda-text-extractor/blob/master/BuildingBinaries.md对于 poppler,请按照以下步骤创建您自己的二进制文件https://github.com/skylander86/lambda-text-extractor/blob/master/BuildingBinaries.md

Straightforward Build Instructions for Poppler on Lambda using Docker使用 Docker 在 Lambda 上构建 Poppler 的简单说明

In order to put Poppler on Lambda, we will build a zipped folder containing poppler and add it as a layer.为了将 Poppler 放在 Lambda 上,我们将构建一个包含 poppler 的压缩文件夹并将其添加为一个层。 Follow these steps on an EC2 instance running Amazon Linux 2 (t2micro is plenty).在运行 Amazon Linux 2(t2micro 就足够了)的 EC2 实例上执行这些步骤。

  1. Setup the machine设置机器

Install docker on the EC2 machine.在 EC2 机器上安装 docker。 Instructions here说明在这里

mkdir -p poppler_binaries
  1. Create a Dockerfile创建一个 Dockerfile

Use this link or copy/paste from below.使用此链接或从下面复制/粘贴。

FROM ubuntu:18.04

# Installing dependencies
RUN apt update
RUN apt-get update
RUN apt-get install -y locate \
                       libopenjp2-7 \
                       poppler-utils

RUN rm -rf /poppler_binaries;  mkdir /poppler_binaries;
RUN updatedb
RUN cp $(locate libpoppler.so) /poppler_binaries/.
RUN cp $(which pdftoppm) /poppler_binaries/.
RUN cp $(which pdfinfo) /poppler_binaries/.
RUN cp $(which pdftocairo) /poppler_binaries/.
RUN cp $(locate libjpeg.so.8 ) /poppler_binaries/.
RUN cp $(locate libopenjp2.so.7 ) /poppler_binaries/.
RUN cp $(locate libpng16.so.16 ) /poppler_binaries/.
RUN cp $(locate libz.so.1 ) /poppler_binaries/.
  1. Build Docker Image and create a zip file构建 Docker Image 并创建一个 zip 文件

Running the commands below will produce a zip file in your home directory.运行以下命令将在您的主目录中生成一个 zip 文件。

docker build -t poppler-build .
# Run the container
docker run -d --name poppler-build-cont poppler-build sleep 20 
#docker exec poppler-build-cont 
sudo docker cp poppler-build-cont:/poppler_binaries .
# Cleaning up
docker kill poppler-build-cont
docker rm poppler-build-cont
docker image rm poppler-build
cd poppler_binaries
zip -r9 ..poppler.zip .
cd ..
  1. Make and add your Lambda Layer制作并添加您的 Lambda 层

Download your zip file or upload it to S3.下载您的 zip 文件或将其上传到 S3。 Head to the Lambda Console page to create a Layer and then add it to your function.前往 Lambda 控制台页面创建一个层,然后将其添加到您的函数中。 Information about layers here . 此处有关图层的信息

  1. Add Environment Variable to Lambda将环境变量添加到 Lambda

In order to avoid adding unnecessary folder structure to the zip as described here .为了避免将不必要的文件夹结构添加到此处所述的 zip 文件中。 We will add an environment variable to point to our dependency我们将添加一个环境变量来指向我们的依赖项

PYTHONPATH: /opt/

And Viola!还有维奥拉! You now have a working Lambda function with Poppler!您现在有了一个使用 Poppler 的 Lambda 函数!

Note: Credit to these two articles which helped me piece this together注意:归功于这两篇文章,它们帮助我将它们拼凑在一起

Warning: do not try to add pdf2image to the same layer.警告:不要尝试将 pdf2image 添加到同一层。 I am not sure why but when they are in the same layer, pdf2image cannot find poppler.我不知道为什么但是当它们在同一层时,pdf2image 找不到 poppler。

Hi @Alex Albracht thanks for compiled easy instructions!嗨@Alex Albracht 感谢编译的简单说明! They helped a lot.他们帮了很多忙。 But I really struggled with getting the lambda function find the poppler path.但我真的很难让 lambda 函数找到 poppler 路径。 So, I'll try to add that up with an effort to make it clear.所以,我会试着把它加起来,努力让它清楚。

The binary files should go in a zip folder having structure as: poppler.zip -> bin/poppler where poppler folder contains the binary files.二进制文件应该放在一个 zip 文件夹中,其结构如下:poppler.zip -> bin/poppler 其中 poppler 文件夹包含二进制文件。 This zip folder can be then uploaded as a layer in AWS lambda.然后可以将此 zip 文件夹作为层上传到 AWS lambda 中。

For pdf2image to work, it needs poppler path.要使 pdf2image 工作,它需要 poppler 路径。 This should be included in the lambda function in the format - "/opt/bin/poppler".这应该以“/opt/bin/poppler”格式包含在 lambda 函数中。

For example, poppler_path = "/opt/bin/poppler" pages = convert_from_path(PDF_file, 500, poppler_path=poppler_path)例如,poppler_path = "/opt/bin/poppler" pages = convert_from_path(PDF_file, 500, poppler_path=poppler_path)

My approach was to use the AWS Linux 2 image as a base to ensure maximum compatibility with the Lambda environment, compile openjpeg and poppler in the container build and build a zip containing the binaries and libraries needed which can then by used as a layer.我的方法是使用 AWS Linux 2 映像作为基础以确保与 Lambda 环境的最大兼容性,在容器构建中编译 openjpeg 和 poppler,并构建一个包含所需二进制文件和库的 zip,然后可以将其用作层。

This enables you to write your code in it's own lambda which pulls in the poppler dependencies as a layer, simplifying build and deployment.这使您能够在它自己的 lambda 中编写代码,该 lambda 将 poppler 依赖项作为一个层引入,从而简化构建和部署。

The contents of the layer will be unpacked into /opt/ .该层的内容将被解压到/opt/ This means the contents will automatically be available because by default in the lambda environment这意味着内容将自动可用,因为默认情况下在 lambda 环境中

  • $PATH is /usr/local/bin:/usr/bin/:/bin:/opt/bin $PATH/usr/local/bin:/usr/bin/:/bin:/opt/bin
  • $LD_LIBRARY_PATH is /lib64:/usr/lib64:$LAMBDA_RUNTIME_DIR:$LAMBDA_RUNTIME_DIR/lib:$LAMBDA_TASK_ROOT:$LAMBDA_TASK_ROOT/lib:/opt/lib $LD_LIBRARY_PATH/lib64:/usr/lib64:$LAMBDA_RUNTIME_DIR:$LAMBDA_RUNTIME_DIR/lib:$LAMBDA_TASK_ROOT:$LAMBDA_TASK_ROOT/lib:/opt/lib

Dockerfile : Dockerfile

# https://www.petewilcock.com/using-poppler-pdftotext-and-other-custom-binaries-on-aws-lambda/

ARG POPPLER_VERSION="21.10.0"
ARG POPPLER_DATA_VERSION="0.4.11"
ARG OPENJPEG_VERSION="2.4.0"


FROM amazonlinux:2

ARG POPPLER_VERSION
ARG POPPLER_DATA_VERSION
ARG OPENJPEG_VERSION

WORKDIR /root

RUN yum update -y
RUN yum install -y \
   cmake \
   cmake3 \
   fontconfig-devel \
   gcc \
   gcc-c++ \
   gzip \
   libjpeg-devel \
   libpng-devel \
   libtiff-devel \
   make \
   tar \
   xz \
   zip

RUN curl -o poppler.tar.xz https://poppler.freedesktop.org/poppler-${POPPLER_VERSION}.tar.xz
RUN tar xf poppler.tar.xz
RUN curl -o poppler-data.tar.gz https://poppler.freedesktop.org/poppler-data-${POPPLER_DATA_VERSION}.tar.gz
RUN tar xf poppler-data.tar.gz
RUN curl -o openjpeg.tar.gz https://codeload.github.com/uclouvain/openjpeg/tar.gz/refs/tags/v${OPENJPEG_VERSION}
RUN tar xf openjpeg.tar.gz

WORKDIR poppler-data-${POPPLER_DATA_VERSION}
RUN make install

WORKDIR /root
RUN mkdir openjpeg-${OPENJPEG_VERSION}/build
WORKDIR openjpeg-${OPENJPEG_VERSION}/build
RUN cmake .. -DCMAKE_BUILD_TYPE=Release
RUN make
RUN make install

WORKDIR /root
RUN mkdir poppler-${POPPLER_VERSION}/build
WORKDIR poppler-${POPPLER_VERSION}/build
RUN cmake3 .. -DCMAKE_BUILD_TYPE=release -DBUILD_GTK_TESTS=OFF -DBUILD_QT5_TESTS=OFF -DBUILD_QT6_TESTS=OFF \
    -DBUILD_CPP_TESTS=OFF -DBUILD_MANUAL_TESTS=OFF -DENABLE_BOOST=OFF -DENABLE_CPP=OFF -DENABLE_GLIB=OFF \
    -DENABLE_GOBJECT_INTROSPECTION=OFF -DENABLE_GTK_DOC=OFF -DENABLE_QT5=OFF -DENABLE_QT6=OFF \
    -DENABLE_LIBOPENJPEG=openjpeg2 -DENABLE_CMS=none  -DBUILD_SHARED_LIBS=OFF
RUN make
RUN make install


WORKDIR /root
RUN mkdir -p package/{lib,bin,share}
RUN cp -d /usr/lib64/libexpat* package/lib
RUN cp -d /usr/lib64/libfontconfig* package/lib
RUN cp -d /usr/lib64/libfreetype* package/lib
RUN cp -d /usr/lib64/libjbig* package/lib
RUN cp -d /usr/lib64/libjpeg* package/lib
RUN cp -d /usr/lib64/libpng* package/lib
RUN cp -d /usr/lib64/libtiff* package/lib
RUN cp -d /usr/lib64/libuuid* package/lib
RUN cp -d /usr/lib64/libz* package/lib
RUN cp -rd /usr/local/lib/* package/lib
RUN cp -rd /usr/local/lib64/* package/lib
RUN cp -d /usr/local/bin/* package/bin
RUN cp -rd /usr/local/share/poppler package/share

WORKDIR package
RUN zip -r9 ../package.zip *

And to run...并且跑...

docker build -t poppler .
docker run --name poppler -d -t poppler cat
docker cp poppler:/root/package.zip .

Then upload package.zip as a layer using the console or aws cli.然后使用控制台或 aws cli 将package.zip作为层上传。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM