简体   繁体   中英

Import pyspark in AWS Lambda function

I created an ETL job in AWS Glue that creates an ORC file with only one raw (that indicates if two other files have the same count of rows).

Now in my pipeline I created an AWS Lambda function to try to read that ORC file and ask if the count of rows is equal in both tables (this ORC file stored in S3 has a value column that indicates if there exists a difference in the counts or not with a 1 and a 0 respectively).

In my first attempt I was trying to use pandas but Lambda gave me the error:

Unable to import module 'lambda_function': No module named

Now I'm trying to import pyspark context like sparksession and use df = spark.read.orc() , but is giving me the same error:

Unable to import module 'lambda_function': No module named 'pyspark'

What do you think? How could I instantiate sparksession in my Lambda function or read the ORC file in another way?

Thank you very much!

Unable to import module 'lambda_function': No module named 'pyspark'

This mean the Lambda can not found the depenedncy, upload a dependency zip file as a Layer for your Lambda .

How to create Layer ?
https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
Or you can add the Layer that built by other contributors.
https://github.com/keithrozario/Klayers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM