AWS Glue Pyspark 轉換過濾器 API 不起作用

Question

我是 AWS Glue 和 Python 的新手。 嘗試為 filter3frame 中的 dynamicFrame datasource0 應用 Filer.apply function。 作業運行失敗，我發現日志中未定義 filter_sex function。 確切的錯誤：“NameError：filter_sex 未定義”。 誰能告訴我做錯了什么？

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "amssurvey", table_name = "amssurvey", transformation_ctx = "datasource0")


filter1frame = Filter.apply(frame=datasource0, f=lambda x:x['citizen'] in ["US"])

filter2frame = Filter.apply(frame=datasource0, f=lambda x:x['count'] > 50)

filter3frame = Filter.apply(frame=datasource0, f=filter_sex(datasource0))







filter1_op = glueContext.write_dynamic_frame.from_options(frame = filter1frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter1_op"}, format = "csv", transformation_ctx = "filter1_op")
filter2_op = glueContext.write_dynamic_frame.from_options(frame = filter2frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter2_op"}, format = "csv", transformation_ctx = "filter2_op")
filter3_op = glueContext.write_dynamic_frame.from_options(frame = filter3frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter3_op"}, format = "csv", transformation_ctx = "filter3_op")
job.commit()



def filter_sex(item):
    if item['sex'] == 'Male':
        return True
    else:
        return False

Answer 1

而不是定義一個函數。 你為什么不試試下面的代碼

filter3frame = Filter.apply(frame=datasource0, f=lambda x:x['sex'] > 'Male')

關於編譯錯誤： filter_sex應該在使用之前定義

Answer 2

我把它修好了。

正如@QuickSilver 所說，每個 function 必須在使用之前定義。 此外，動態框架必須如下所示。 調用它的 filter_sex function 不需要有參數。

filter3frame = Filter.apply(frame=datasource0, f=filter_sex)

所以最終的工作代碼如下 -

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

def filter_sex(item):
    if item['sex'] == 'Male':
        return True
    else:
        return False



## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "amssurvey", table_name = "amssurvey", transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "amssurvey", table_name = "amssurvey", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("nomber", "long", "nomber", "long"), ("type", "string", "type", "string"), ("sex", "string", "sex", "string"), ("citizen", "string", "citizen", "string"), ("count", "long", "count", "long"), ("countstate", "long", "countstate", "long")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]


filter1frame = Filter.apply(frame=datasource0, f=lambda x:x['citizen'] in ["US"])

filter2frame = Filter.apply(frame=datasource0, f=lambda x:x['count'] > 50)

filter3frame = Filter.apply(frame=datasource0, f=filter_sex)







filter1_op = glueContext.write_dynamic_frame.from_options(frame = filter1frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter1_op"}, format = "csv", transformation_ctx = "filter1_op")
filter2_op = glueContext.write_dynamic_frame.from_options(frame = filter2frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter2_op"}, format = "csv", transformation_ctx = "filter2_op")
filter3_op = glueContext.write_dynamic_frame.from_options(frame = filter3frame, connection_type = "s3", connection_options = {"path": "s3://asgqatestautomation3/SourceFiles/filter3_op"}, format = "csv", transformation_ctx = "filter3_op")
job.commit()

AWS Glue Pyspark 轉換過濾器 API 不起作用

問題描述

2 個解決方案

解決方案1
1 2020-04-27 08:21:50

解決方案2
1 已采納 2020-04-27 09:10:14

AWS Glue Pyspark 轉換過濾器 API 不起作用

問題描述

2 個解決方案

解決方案1 1 2020-04-27 08:21:50

解決方案2 1 已采納 2020-04-27 09:10:14

解決方案1
1 2020-04-27 08:21:50

解決方案2
1 已采納 2020-04-27 09:10:14