I am trying to pass "pdf" or "html" file directly into lambda function. But I dont understand the correct format in which it should be received?
Eg: I was able to under stand how to pass "image" files through lambda functions by using the following code: But how do I send a pdf or html file?
def write_to_file(save_path, data):
with open(save_path, "wb") as f:
f.write(base64.b64decode(data))
def ocr(img):
ocr_text = pytesseract.image_to_string(img, config = "eng")
return ocr_text
def lambda_handler(event, context=None):
write_to_file("/tmp/photo.jpg", event["body"])
im = Image.open("/tmp/photo.jpg")
try:
ocr_text = ocr(im)
except Exception as e:
print(e)
# Return the result data in json format
return {
"statusCode": 200,
"body": ocr_text
}
Edit: I am trying to pass the "pdf" or "html" directly through API gateway (binary) and not through S3.
You can use the API gateway Content type conversions.
You can refer to this documentation
Thanks. But after massive online searching and try/repeat, was able to find the answer for html file. Similar thing should work for pdf also.
import json
import bs4
from bs4 import BeautifulSoup
from bs4.element import Comment
import base64
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta','table', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
def lambda_handler(event, context):
# This will work for testing purpose only
#soup = BeautifulSoup(event["body"], "html.parser")
# This will work when you actually upload files
file_upload = base64.b64decode(event["body"])
soup = BeautifulSoup(file_upload, "html.parser")
print(soup)
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
full_text = str(u" ".join(t.strip() for t in visible_texts))
return {
"statusCode": 200,
"body": json.dumps(full_text)
}
Additionally in API Gateway - you would need to make the following 2 changes:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.