[英]Changing datatype of a specific column of dynamicframe in aws glue
問候所有專家,
我遇到了一個問題,我需要一個解決方案。 請幫我解決一下這個。
所以,我有一個從存儲在 s3 中的 XML 文件創建的動態框架。
該框架有一個嵌套字段“ ReceiptNumber ”,動態框架的架構如下:
root
|-- Receipt: struct
| |-- Front: struct
| | |-- FrontNumber: string
| | |-- CountryorTerritoryCode: string
| | |-- TaxId: string
| |-- ReceiptAmount: double
| |-- ReceiptCurrencyCode: string
| |-- ReceiptDateCCYYMMDD: int
| |-- ReceiptNumber: double
| |-- TaxVarianceAmount: double
| |-- TransferDetails: array
| | |-- element: struct
| | | |-- BillCategoryCode: string
| | | |-- BillCategoryDetailCode: string
| | | |-- Porting: array
| | | | |-- element: struct
| | | | | |-- AddressDetails: struct
| | | | | | |-- ConsigneeAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- AddressText2: string
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StateCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | | |-- SenderAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StateCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | | |-- ThirdPartyAddress: struct
| | | | | | | |-- Address: struct
| | | | | | | | |-- CityName: string
| | | | | | | | |-- CountryorTerritoryCode: string
| | | | | | | | |-- PostalCode: string
| | | | | | | | |-- StreetAddress: string
| | | | | | | |-- Addressee: struct
| | | | | | | | |-- Name: string
| | | | | | | |-- Attention: struct
| | | | | | | | |-- Name: string
| | | | | |-- BillOptionCode: string
| | | | | |-- LeadPortingNumber: string
| | | | | |-- Package: array
| | | | | | |-- element: struct
| | | | | | | |-- BillDetails: struct
| | | | | | | | |-- Bill: array
| | | | | | | | | |-- element: struct
| | | | | | | | | | |-- BillInformation: array
| | | | | | | | | | | |-- element: struct
| | | | | | | | | | | | |-- BasisCurrencyCode: string
| | | | | | | | | | | | |-- BasisValue: double
| | | | | | | | | | | | |-- BilldUnitQuantity: int
| | | | | | | | | | | | |-- CurrencyCode: string
| | | | | | | | | | | | |-- DescriptionCode: string
| | | | | | | | | | | | |-- DescriptionOfBills: string
| | | | | | | | | | | | |-- ExemptionAmount: double
| | | | | | | | | | | | |-- IncentiveAmount: double
| | | | | | | | | | | | |-- NetAmount: double
| | | | | | | | | | | | |-- TaxIndicator: double
| | | | | | | | | | |-- ClassificationCode: string
| | | | | | | |-- ContainerType: string
| | | | | | | |-- MiscellaneousDetails: struct
| | | | | | | | |-- MiscellaneousLineItems: struct
| | | | | | | | | |-- LineItem: struct
| | | | | | | | | | |-- LineNumber: int
| | | | | | | | | | |-- LineText: string
| | | | | | | |-- PackageBillableKeyedDimensions: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageDimension: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageKeyedDimensions: struct
| | | | | | | | |-- Height: double
| | | | | | | | |-- Length: double
| | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | |-- Width: double
| | | | | | | |-- PackageQuantity: struct
| | | | | | | | |-- ActualQuantity: struct
| | | | | | | | | |-- Quantity: int
| | | | | | | |-- PackageWeight: struct
| | | | | | | | |-- ActualWeight: struct
| | | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | | |-- Weight: double
| | | | | | | | |-- BilledWeight: struct
| | | | | | | | | |-- UnitOfMeasure: string
| | | | | | | | | |-- Weight: double
| | | | | | | | |-- BilledWeightType: double
| | | | | | | |-- TrackingNumber: string
| | | | | | | |-- Zone: int
| | | | | |-- PayerRoleCd: int
| | | | | |-- PickUpRecordNumber: long
| | | | | |-- PortingReferences: struct
| | | | | | |-- Reference: array
| | | | | | | |-- element: struct
| | | | | | | | |-- ReferenceNumber: string
| | | | | | | | |-- Sequence: int
| | | | | |-- TransferDateCCYYMMDD: int
| |-- TypeCode: string
| |-- TypeDetailCode: double
在編寫動態框架之前我想要更改的是使字段“ ReceiptNumber ”成為字符串類型,如下所示
....
....
| |-- ReceiptCurrencyCode: string
| |-- ReceiptDateCCYYMMDD: int
| |-- ReceiptNumber: string
| |-- TaxVarianceAmount: double
....
....
可以通過apply_mapping
嗎?
有沒有替代的解決方案?
最后,我能夠用一點不同的方法來解決它。
所以,回顧一下,我有一個 Glue ETL 類型的作業,用 python 腳本編寫。
它負責處理 XML 文件。 處理完 XML 文件后,它的架構如上,正如我在問題中提到的。
因此,我想將其節點之一的類型“ReceiptNumber”從int
更改為string
。
所以,首先我像往常一樣從 s3 文件創建了一個動態框架
d0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"rowTag": "ReceiptDetails"}, transformation_ctx = "d0")
然后,將動態幀變成 pyspark dataframe 如下圖
df = d0.toDF();
然后,我使用了下面鏈接中編寫的 function,我們如何修改嵌套結構字段及其類型。
從 function 中,我創建了一個new_schema
,如下所示,並將其轉換為一個新的動態框架,如下所示。
df = df.withColumn("Receipt_json", to_json("Receipt")).drop("Receipt")
df = df.withColumn("Receipt", from_json("Receipt_json", new_schema)).drop("Receipt_json")
d0 = DynamicFrame.fromDF(df, glueContext, "d0")
從具有修改字段“ReceiptNumber”(從int
到string
)的新動態框架中,我創建了一個 JSON 模式,如下所示。
receiptSchema = d0.schema()
withReceiptSchema = json.dumps(receiptSchema.jsonValue())
最后,我使用新模式再次創建了模式,如下所示,並將其寫在 JSON 文件中,如下所示。
d0 = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"withSchema": withReceiptSchema, "rowTag": "ReceiptDetails"}, transformation_ctx = "d0")
# writing the down the data from above schema in a JSON file
glueContext.write_dynamic_frame.from_options(frame = d0, connection_type = "s3", connection_options = {"path": s3_write_path}, format = "json")
我希望,如果有人在從事 Aws Glue 工作時遇到這種錯誤或障礙,這個答案可能會有所幫助。
我有一個類似的問題,我必須添加/刪除和更改許多列的類型。 對於我的情況,我最終使用了Map 轉換,該轉換將 function 應用於 DynamicFrame 的所有記錄。
inputDyf = glueContext.create_dynamic_frame_from_options(
...
)
def mapping(record: Dict[str, Any]):
record["UpdatedAt"] = int(time.mktime(datetime.date.today().timetuple()))
record["SomeVal"] = int(record["SomeVal"])
# ... put, del and other dict operations
return record
mapped_dyF = Map.apply(frame=inputDyf, f=mapping)
您還可以在使用 XML 格式時在 create_dynamic_frame_from_options 方法上指定架構:
schema = StructType([
Field("name", StringType()),
])
datasource0 = create_dynamic_frame_from_options(
connection_type,
connection_options={"paths": ["s3://xml_bucket/someprefix"]},
format="xml",
format_options={"withSchema": json.dumps(schema.jsonValue())},
transformation_ctx = ""
)
# or directly as an string
datasource0 = create_dynamic_frame_from_options(
connection_type,
connection_options={"paths": ["s3://xml_bucket/someprefix"]},
format="xml",
format_options={
"withSchema": """
{
"dataType": "struct",
"properties": {},
"fields": [
{
"name": "name",
"container": {
"dataType": "string",
"properties": {}
},
"properties": {}
}
]
}
"""},
transformation_ctx = ""
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.