[英]AWS Glue Pyspark Transformation Filter API not working
我是 AWS Glue 和 Python 的新手。 嘗試為 filter3frame 中的 dynamicFrame datasource0 應用 Filer.apply function。 作業運行失敗,我發現日志中未定義 filter_sex function。 確切的錯誤:“NameError:filter_sex 未定義”。 誰能告訴我做錯了什么?
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "amssurvey", table_name = "amssurvey", transformation_ctx = "datasource0")
filter1frame = Filter.apply(frame=datasource0, f=lambda x:x['citizen'] in ["US"])
filter2frame = Filter.apply(frame=datasource0, f=lambda x:x['count'] > 50)
filter3frame = Filter.apply(frame=datasource0, f=filter_sex(datasource0))
filter1_op = glueContext.write_dynamic_frame.from_options(frame = filter1frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter1_op"}, format = "csv", transformation_ctx = "filter1_op")
filter2_op = glueContext.write_dynamic_frame.from_options(frame = filter2frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter2_op"}, format = "csv", transformation_ctx = "filter2_op")
filter3_op = glueContext.write_dynamic_frame.from_options(frame = filter3frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter3_op"}, format = "csv", transformation_ctx = "filter3_op")
job.commit()
def filter_sex(item):
if item['sex'] == 'Male':
return True
else:
return False
而不是定義一個函數。 你為什么不試試下面的代碼
filter3frame = Filter.apply(frame=datasource0, f=lambda x:x['sex'] > 'Male')
關於編譯錯誤: filter_sex
應該在使用之前定義
我把它修好了。
正如@QuickSilver 所說,每個 function 必須在使用之前定義。 此外,動態框架必須如下所示。 調用它的 filter_sex function 不需要有參數。
filter3frame = Filter.apply(frame=datasource0, f=filter_sex)
所以最終的工作代碼如下 -
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
def filter_sex(item):
if item['sex'] == 'Male':
return True
else:
return False
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "amssurvey", table_name = "amssurvey", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "amssurvey", table_name = "amssurvey", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("nomber", "long", "nomber", "long"), ("type", "string", "type", "string"), ("sex", "string", "sex", "string"), ("citizen", "string", "citizen", "string"), ("count", "long", "count", "long"), ("countstate", "long", "countstate", "long")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
filter1frame = Filter.apply(frame=datasource0, f=lambda x:x['citizen'] in ["US"])
filter2frame = Filter.apply(frame=datasource0, f=lambda x:x['count'] > 50)
filter3frame = Filter.apply(frame=datasource0, f=filter_sex)
filter1_op = glueContext.write_dynamic_frame.from_options(frame = filter1frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter1_op"}, format = "csv", transformation_ctx = "filter1_op")
filter2_op = glueContext.write_dynamic_frame.from_options(frame = filter2frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter2_op"}, format = "csv", transformation_ctx = "filter2_op")
filter3_op = glueContext.write_dynamic_frame.from_options(frame = filter3frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter3_op"}, format = "csv", transformation_ctx = "filter3_op")
job.commit()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.