简体   繁体   中英

How to pass “pdf” & “html” files as “events” in AWS Lambda via API Gateway?

I am trying to pass "pdf" or "html" file directly into lambda function. But I dont understand the correct format in which it should be received?

Eg: I was able to under stand how to pass "image" files through lambda functions by using the following code: But how do I send a pdf or html file?

def write_to_file(save_path, data):
  with open(save_path, "wb") as f:
    f.write(base64.b64decode(data))

def ocr(img):
  ocr_text = pytesseract.image_to_string(img, config = "eng")  
  return ocr_text


def lambda_handler(event, context=None):

    write_to_file("/tmp/photo.jpg", event["body"])
    im = Image.open("/tmp/photo.jpg")
    try:
      ocr_text = ocr(im)
    except Exception as e:
      print(e)

    # Return the result data in json format
    return {
      "statusCode": 200,
      "body": ocr_text

    }

Edit: I am trying to pass the "pdf" or "html" directly through API gateway (binary) and not through S3.

You can use the API gateway Content type conversions.

You can refer to this documentation

Thanks. But after massive online searching and try/repeat, was able to find the answer for html file. Similar thing should work for pdf also.

import json
import bs4
from bs4 import BeautifulSoup
from bs4.element import Comment
import base64

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta','table', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def lambda_handler(event, context):
    # This will work for testing purpose only
    #soup = BeautifulSoup(event["body"], "html.parser")

    # This will work when you actually upload files
    file_upload = base64.b64decode(event["body"])
    soup = BeautifulSoup(file_upload, "html.parser")
    print(soup)
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts) 
    full_text = str(u" ".join(t.strip() for t in visible_texts))

    return {
        "statusCode": 200,
        "body": json.dumps(full_text)
    }

Additionally in API Gateway - you would need to make the following 2 changes:

  1. Add / in Binary Media Types
  2. Under Method Respone - Add "Content-Type" = "application/html"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM