[英]Changing datatype of a specific column of dynamicframe in aws glue
问候所有专家,
我遇到了一个问题,我需要一个解决方案。 请帮我解决一下这个。
所以,我有一个从存储在 s3 中的 XML 文件创建的动态框架。
该框架有一个嵌套字段“ ReceiptNumber ”,动态框架的架构如下:
root
|-- Receipt: struct
| |-- Front: struct
| | |-- FrontNumber: string
| | |-- CountryorTerritoryCode: string
| | |-- TaxId: string
| |-- ReceiptAmount: double
| |-- ReceiptCurrencyCode: string
| |-- ReceiptDateCCYYMMDD: int
| |-- ReceiptNumber: double
| |-- TaxVarianceAmount: double
| |-- TransferDetails: array
| | |-- element: struct
| | | |-- BillCategoryCode: string
| | | |-- BillCategoryDetailCode: string
| | | |-- Porting: array
| | | | |-- element: struct
| | | | | |-- AddressDetails: struct
| | | | | | |-- ConsigneeAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- AddressText2: string
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StateCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | | |-- SenderAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StateCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | | |-- ThirdPartyAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | |-- BillOptionCode: string
| | | | | |-- LeadPortingNumber: string
| | | | | |-- Package: array
| | | | | | |-- element: struct
| | | | | | | |-- BillDetails: struct
| | | | | | | | |-- Bill: array
| | | | | | | | | |-- element: struct
| | | | | | | | | | |-- BillInformation: array
| | | | | | | | | | | |-- element: struct
| | | | | | | | | | | | |-- BasisCurrencyCode: string
| | | | | | | | | | | | |-- BasisValue: double
| | | | | | | | | | | | |-- BilldUnitQuantity: int
| | | | | | | | | | | | |-- CurrencyCode: string
| | | | | | | | | | | | |-- DescriptionCode: string
| | | | | | | | | | | | |-- DescriptionOfBills: string
| | | | | | | | | | | | |-- ExemptionAmount: double
| | | | | | | | | | | | |-- IncentiveAmount: double
| | | | | | | | | | | | |-- NetAmount: double
| | | | | | | | | | | | |-- TaxIndicator: double
| | | | | | | | | | |-- ClassificationCode: string
| | | | | | | |-- ContainerType: string
| | | | | | | |-- MiscellaneousDetails: struct
| | | | | | | | |-- MiscellaneousLineItems: struct
| | | | | | | | | |-- LineItem: struct
| | | | | | | | | | |-- LineNumber: int
| | | | | | | | | | |-- LineText: string
| | | | | | | |-- PackageBillableKeyedDimensions: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageDimension: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageKeyedDimensions: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageQuantity: struct
| | | | | | | | |-- ActualQuantity: struct
| | | | | | | | | |-- Quantity: int
| | | | | | | |-- PackageWeight: struct
| | | | | | | | |-- ActualWeight: struct
| | | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | | |-- Weight: double
| | | | | | | | |-- BilledWeight: struct
| | | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | | |-- Weight: double
| | | | | | | | |-- BilledWeightType: double
| | | | | | | |-- TrackingNumber: string
| | | | | | | |-- Zone: int
| | | | | |-- PayerRoleCd: int
| | | | | |-- PickUpRecordNumber: long
| | | | | |-- PortingReferences: struct
| | | | | | |-- Reference: array
| | | | | | | |-- element: struct
| | | | | | | | |-- ReferenceNumber: string
| | | | | | | | |-- Sequence: int
| | | | | |-- TransferDateCCYYMMDD: int
| |-- TypeCode: string
| |-- TypeDetailCode: double
在编写动态框架之前我想要更改的是使字段“ ReceiptNumber ”成为字符串类型,如下所示
....
....
| |-- ReceiptCurrencyCode: string
| |-- ReceiptDateCCYYMMDD: int
| |-- ReceiptNumber: string
| |-- TaxVarianceAmount: double
....
....
可以通过apply_mapping
吗?
有没有替代的解决方案?
最后,我能够用一点不同的方法来解决它。
所以,回顾一下,我有一个 Glue ETL 类型的作业,用 python 脚本编写。
它负责处理 XML 文件。 处理完 XML 文件后,它的架构如上,正如我在问题中提到的。
因此,我想将其节点之一的类型“ReceiptNumber”从int
更改为string
。
所以,首先我像往常一样从 s3 文件创建了一个动态框架
d0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"rowTag": "ReceiptDetails"}, transformation_ctx = "d0")
然后,将动态帧变成 pyspark dataframe 如下图
df = d0.toDF();
然后,我使用了下面链接中编写的 function,我们如何修改嵌套结构字段及其类型。
从 function 中,我创建了一个new_schema
,如下所示,并将其转换为一个新的动态框架,如下所示。
df = df.withColumn("Receipt_json", to_json("Receipt")).drop("Receipt")
df = df.withColumn("Receipt", from_json("Receipt_json", new_schema)).drop("Receipt_json")
d0 = DynamicFrame.fromDF(df, glueContext, "d0")
从具有修改字段“ReceiptNumber”(从int
到string
)的新动态框架中,我创建了一个 JSON 模式,如下所示。
receiptSchema = d0.schema()
withReceiptSchema = json.dumps(receiptSchema.jsonValue())
最后,我使用新模式再次创建了模式,如下所示,并将其写在 JSON 文件中,如下所示。
d0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"withSchema": withReceiptSchema, "rowTag": "ReceiptDetails"}, transformation_ctx = "d0")
# writing the down the data from above schema in a JSON file
glueContext.write_dynamic_frame.from_options(frame = d0, connection_type = "s3", connection_options = {"path": s3_write_path}, format = "json")
我希望,如果有人在从事 Aws Glue 工作时遇到这种错误或障碍,这个答案可能会有所帮助。
我有一个类似的问题,我必须添加/删除和更改许多列的类型。 对于我的情况,我最终使用了Map 转换,该转换将 function 应用于 DynamicFrame 的所有记录。
inputDyf = glueContext.create_dynamic_frame_from_options(
...
)
def mapping(record: Dict[str, Any]):
record["UpdatedAt"] = int(time.mktime(datetime.date.today().timetuple()))
record["SomeVal"] = int(record["SomeVal"])
# ... put, del and other dict operations
return record
mapped_dyF = Map.apply(frame=inputDyf, f=mapping)
您还可以在使用 XML 格式时在 create_dynamic_frame_from_options 方法上指定架构:
schema = StructType([
Field("name", StringType()),
])
datasource0 = create_dynamic_frame_from_options(
connection_type,
connection_options={"paths": ["s3://xml_bucket/someprefix"]},
format="xml",
format_options={"withSchema": json.dumps(schema.jsonValue())},
transformation_ctx = ""
)
# or directly as an string
datasource0 = create_dynamic_frame_from_options(
connection_type,
connection_options={"paths": ["s3://xml_bucket/someprefix"]},
format="xml",
format_options={
"withSchema": """
{
"dataType": "struct",
"properties": {},
"fields": [
{
"name": "name",
"container": {
"dataType": "string",
"properties": {}
},
"properties": {}
}
]
}
"""},
transformation_ctx = ""
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.