简体   繁体   中英

How to use AWS lambda to convert pdf files to .txt with python

I need to automate the conversion of many pdf to text files using AWS lambda in python 3.7

I've successfully converted pdf files using poppler/pdftotext, tika, and PyPDF2 on my own machine. However tika times out or needs to run a java instance on a host machine which I'm not sure how to set up. pdftotext needs poppler and all the solutions for running that on lambda seems to be outdated or I'm just not familiar enough with binarys to make sense of that solution. PyPDF2 seems the most promising but testing throws an error.

The code and error I'm getting for PyPDF2 is as follows:

pdf_file = open(s3.Bucket(my_bucket).download_file('test.pdf','test.pdf'),'rb')

  "errorMessage": "[Errno 30] Read-only file system: 'test.pdf.3F925aC8'",
  "errorType": "OSError",



and if I try to reference it directly,
pdf_file = open('https://s3.amazonaws.com/' + my_bucket + '/test.pdf', 'rb')

  "errorMessage": "[Errno 2] No such file or directory: 'https://s3.amazonaws.com/my_bucket/test.pdf'",
  "errorType": "FileNotFoundError",

As the error states, you are trying to write to a read-only filesystem. You are using the download_file method which tries to save the file to 'test.pdf' which fails. Try using download_fileobj (link) together with an in-memory buffer (eg io.BytesIO ) instead. Then, feed that stream to PyPDF2.

Example:

import io
[...]

pdf_stream = io.StringIO()
object.download_fileobj(pdf_stream)
pdf_obj = PdfFileReader(pdf_stream)

[...]

AWS lambda 只允许您写入 /tmp 文件夹,因此您应该下载文件并将其放在那里

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM