繁体   English   中英

在 aws 胶水中更改特定动态帧列的数据类型

[英]Changing datatype of a specific column of dynamicframe in aws glue

问候所有专家,

我遇到了一个问题,我需要一个解决方案。 请帮我解决一下这个。

所以,我有一个从存储在 s3 中的 XML 文件创建的动态框架。

该框架有一个嵌套字段“ ReceiptNumber ”,动态框架的架构如下:

root
|-- Receipt: struct
|    |-- Front: struct
|    |    |-- FrontNumber: string
|    |    |-- CountryorTerritoryCode: string
|    |    |-- TaxId: string
|    |-- ReceiptAmount: double
|    |-- ReceiptCurrencyCode: string
|    |-- ReceiptDateCCYYMMDD: int
|    |-- ReceiptNumber: double
|    |-- TaxVarianceAmount: double
|    |-- TransferDetails: array
|    |    |-- element: struct
|    |    |    |-- BillCategoryCode: string
|    |    |    |-- BillCategoryDetailCode: string
|    |    |    |-- Porting: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- AddressDetails: struct
|    |    |    |    |    |    |-- ConsigneeAddress: struct
|    |    |    |    |    |    |    |-- Address: struct
|    |    |    |    |    |    |    |    |-- AddressText2: string
|    |    |    |    |    |    |    |    |-- CityName: string
|    |    |    |    |    |    |    |    |-- CountryorTerritoryCode: string
|    |    |    |    |    |    |    |    |-- PostalCode: string
|    |    |    |    |    |    |    |    |-- StateCode: string
|    |    |    |    |    |    |    |    |-- StreetAddress: string
|    |    |    |    |    |    |    |-- Addressee: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |    |-- Attention: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |-- SenderAddress: struct
|    |    |    |    |    |    |    |-- Address: struct
|    |    |    |    |    |    |    |    |-- CityName: string
|    |    |    |    |    |    |    |    |-- CountryorTerritoryCode: string
|    |    |    |    |    |    |    |    |-- PostalCode: string
|    |    |    |    |    |    |    |    |-- StateCode: string
|    |    |    |    |    |    |    |    |-- StreetAddress: string
|    |    |    |    |    |    |    |-- Addressee: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |    |-- Attention: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |-- ThirdPartyAddress: struct
|    |    |    |    |    |    |    |-- Address: struct
|    |    |    |    |    |    |    |    |-- CityName: string
|    |    |    |    |    |    |    |    |-- CountryorTerritoryCode: string
|    |    |    |    |    |    |    |    |-- PostalCode: string
|    |    |    |    |    |    |    |    |-- StreetAddress: string
|    |    |    |    |    |    |    |-- Addressee: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |    |-- Attention: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |-- BillOptionCode: string
|    |    |    |    |    |-- LeadPortingNumber: string
|    |    |    |    |    |-- Package: array
|    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |-- BillDetails: struct
|    |    |    |    |    |    |    |    |-- Bill: array
|    |    |    |    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |    |    |    |-- BillInformation: array
|    |    |    |    |    |    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |    |    |    |    |    |-- BasisCurrencyCode: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- BasisValue: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- BilldUnitQuantity: int
|    |    |    |    |    |    |    |    |    |    |    |    |-- CurrencyCode: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- DescriptionCode: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- DescriptionOfBills: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- ExemptionAmount: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- IncentiveAmount: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- NetAmount: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- TaxIndicator: double
|    |    |    |    |    |    |    |    |    |    |-- ClassificationCode: string
|    |    |    |    |    |    |    |-- ContainerType: string
|    |    |    |    |    |    |    |-- MiscellaneousDetails: struct
|    |    |    |    |    |    |    |    |-- MiscellaneousLineItems: struct
|    |    |    |    |    |    |    |    |    |-- LineItem: struct
|    |    |    |    |    |    |    |    |    |    |-- LineNumber: int
|    |    |    |    |    |    |    |    |    |    |-- LineText: string
|    |    |    |    |    |    |    |-- PackageBillableKeyedDimensions: struct
|    |    |    |    |    |    |    |    |-- Height: double
|    |    |    |    |    |    |    |    |-- Length: double
|    |    |    |    |    |    |    |    |-- Width: double
|    |    |    |    |    |    |    |-- PackageDimension: struct
|    |    |    |    |    |    |    |    |-- Height: double
|    |    |    |    |    |    |    |    |-- Length: double
|    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |-- Width: double
|    |    |    |    |    |    |    |-- PackageKeyedDimensions: struct
|    |    |    |    |    |    |    |    |-- Height: double
|    |    |    |    |    |    |    |    |-- Length: double
|    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |-- Width: double
|    |    |    |    |    |    |    |-- PackageQuantity: struct
|    |    |    |    |    |    |    |    |-- ActualQuantity: struct
|    |    |    |    |    |    |    |    |    |-- Quantity: int
|    |    |    |    |    |    |    |-- PackageWeight: struct
|    |    |    |    |    |    |    |    |-- ActualWeight: struct
|    |    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |    |-- Weight: double
|    |    |    |    |    |    |    |    |-- BilledWeight: struct
|    |    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |    |-- Weight: double
|    |    |    |    |    |    |    |    |-- BilledWeightType: double
|    |    |    |    |    |    |    |-- TrackingNumber: string
|    |    |    |    |    |    |    |-- Zone: int
|    |    |    |    |    |-- PayerRoleCd: int
|    |    |    |    |    |-- PickUpRecordNumber: long
|    |    |    |    |    |-- PortingReferences: struct
|    |    |    |    |    |    |-- Reference: array
|    |    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |    |-- ReferenceNumber: string
|    |    |    |    |    |    |    |    |-- Sequence: int
|    |    |    |    |    |-- TransferDateCCYYMMDD: int
|    |-- TypeCode: string
|    |-- TypeDetailCode: double

在编写动态框架之前我想要更改的是使字段“ ReceiptNumber ”成为字符串类型,如下所示

....
....
|    |-- ReceiptCurrencyCode: string
|    |-- ReceiptDateCCYYMMDD: int
|    |-- ReceiptNumber: string
|    |-- TaxVarianceAmount: double
....
....

可以通过apply_mapping吗?

有没有替代的解决方案?

最后,我能够用一点不同的方法来解决它。

所以,回顾一下,我有一个 Glue ETL 类型的作业,用 python 脚本编写。

它负责处理 XML 文件。 处理完 XML 文件后,它的架构如上,正如我在问题中提到的。

因此,我想将其节点之一的类型“ReceiptNumber”从int更改为string

所以,首先我像往常一样从 s3 文件创建了一个动态框架

d0  = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"rowTag": "ReceiptDetails"}, transformation_ctx = "d0")

然后,将动态帧变成 pyspark dataframe 如下图

df = d0.toDF();

然后,我使用了下面链接中编写的 function,我们如何修改嵌套结构字段及其类型。

Pyspark:如何修改嵌套结构字段

从 function 中,我创建了一个new_schema ,如下所示,并将其转换为一个新的动态框架,如下所示。

df = df.withColumn("Receipt_json", to_json("Receipt")).drop("Receipt")
df = df.withColumn("Receipt", from_json("Receipt_json", new_schema)).drop("Receipt_json")
d0 = DynamicFrame.fromDF(df, glueContext, "d0")

从具有修改字段“ReceiptNumber”(从intstring )的新动态框架中,我创建了一个 JSON 模式,如下所示。

receiptSchema = d0.schema()
withReceiptSchema = json.dumps(receiptSchema.jsonValue())

最后,我使用新模式再次创建了模式,如下所示,并将其写在 JSON 文件中,如下所示。

d0  = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"withSchema": withReceiptSchema, "rowTag": "ReceiptDetails"}, transformation_ctx = "d0")

# writing the down the data from above schema in a JSON file
glueContext.write_dynamic_frame.from_options(frame = d0, connection_type = "s3", connection_options = {"path": s3_write_path}, format = "json")

我希望,如果有人在从事 Aws Glue 工作时遇到这种错误或障碍,这个答案可能会有所帮助。

我有一个类似的问题,我必须添加/删除和更改许多列的类型。 对于我的情况,我最终使用了Map 转换,该转换将 function 应用于 DynamicFrame 的所有记录

inputDyf = glueContext.create_dynamic_frame_from_options(
    ...
)


def mapping(record: Dict[str, Any]):
    record["UpdatedAt"] = int(time.mktime(datetime.date.today().timetuple()))
    record["SomeVal"] = int(record["SomeVal"])

    # ... put, del and other dict operations
    return record 

mapped_dyF = Map.apply(frame=inputDyf, f=mapping)

您还可以在使用 XML 格式时在 create_dynamic_frame_from_options 方法上指定架构

schema = StructType([ 
  Field("name", StringType()),
])

datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={"withSchema": json.dumps(schema.jsonValue())},
    transformation_ctx = ""
)

# or directly as an string
datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={
        "withSchema": """
{
  "dataType": "struct",
  "properties": {},
  "fields": [
    {
      "name": "name",
      "container": {
        "dataType": "string",
        "properties": {}
      },
      "properties": {}
    }
  ]
}
    """},
    transformation_ctx = ""
)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM