简体   繁体   English

在 aws 胶水中更改特定动态帧列的数据类型

[英]Changing datatype of a specific column of dynamicframe in aws glue

Greetings all experts,问候所有专家,

I've faced a problem and I need a solution.我遇到了一个问题,我需要一个解决方案。 Please help me with this.请帮我解决一下这个。

So, I have a dynamic frame created from an XML file stored in s3.所以,我有一个从存储在 s3 中的 XML 文件创建的动态框架。

The frame has a nested field ' ReceiptNumber ' and the dynamic frame's schema is like below:该框架有一个嵌套字段“ ReceiptNumber ”,动态框架的架构如下:

root
|-- Receipt: struct
|    |-- Front: struct
|    |    |-- FrontNumber: string
|    |    |-- CountryorTerritoryCode: string
|    |    |-- TaxId: string
|    |-- ReceiptAmount: double
|    |-- ReceiptCurrencyCode: string
|    |-- ReceiptDateCCYYMMDD: int
|    |-- ReceiptNumber: double
|    |-- TaxVarianceAmount: double
|    |-- TransferDetails: array
|    |    |-- element: struct
|    |    |    |-- BillCategoryCode: string
|    |    |    |-- BillCategoryDetailCode: string
|    |    |    |-- Porting: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- AddressDetails: struct
|    |    |    |    |    |    |-- ConsigneeAddress: struct
|    |    |    |    |    |    |    |-- Address: struct
|    |    |    |    |    |    |    |    |-- AddressText2: string
|    |    |    |    |    |    |    |    |-- CityName: string
|    |    |    |    |    |    |    |    |-- CountryorTerritoryCode: string
|    |    |    |    |    |    |    |    |-- PostalCode: string
|    |    |    |    |    |    |    |    |-- StateCode: string
|    |    |    |    |    |    |    |    |-- StreetAddress: string
|    |    |    |    |    |    |    |-- Addressee: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |    |-- Attention: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |-- SenderAddress: struct
|    |    |    |    |    |    |    |-- Address: struct
|    |    |    |    |    |    |    |    |-- CityName: string
|    |    |    |    |    |    |    |    |-- CountryorTerritoryCode: string
|    |    |    |    |    |    |    |    |-- PostalCode: string
|    |    |    |    |    |    |    |    |-- StateCode: string
|    |    |    |    |    |    |    |    |-- StreetAddress: string
|    |    |    |    |    |    |    |-- Addressee: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |    |-- Attention: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |-- ThirdPartyAddress: struct
|    |    |    |    |    |    |    |-- Address: struct
|    |    |    |    |    |    |    |    |-- CityName: string
|    |    |    |    |    |    |    |    |-- CountryorTerritoryCode: string
|    |    |    |    |    |    |    |    |-- PostalCode: string
|    |    |    |    |    |    |    |    |-- StreetAddress: string
|    |    |    |    |    |    |    |-- Addressee: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |    |    |-- Attention: struct
|    |    |    |    |    |    |    |    |-- Name: string
|    |    |    |    |    |-- BillOptionCode: string
|    |    |    |    |    |-- LeadPortingNumber: string
|    |    |    |    |    |-- Package: array
|    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |-- BillDetails: struct
|    |    |    |    |    |    |    |    |-- Bill: array
|    |    |    |    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |    |    |    |-- BillInformation: array
|    |    |    |    |    |    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |    |    |    |    |    |-- BasisCurrencyCode: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- BasisValue: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- BilldUnitQuantity: int
|    |    |    |    |    |    |    |    |    |    |    |    |-- CurrencyCode: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- DescriptionCode: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- DescriptionOfBills: string
|    |    |    |    |    |    |    |    |    |    |    |    |-- ExemptionAmount: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- IncentiveAmount: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- NetAmount: double
|    |    |    |    |    |    |    |    |    |    |    |    |-- TaxIndicator: double
|    |    |    |    |    |    |    |    |    |    |-- ClassificationCode: string
|    |    |    |    |    |    |    |-- ContainerType: string
|    |    |    |    |    |    |    |-- MiscellaneousDetails: struct
|    |    |    |    |    |    |    |    |-- MiscellaneousLineItems: struct
|    |    |    |    |    |    |    |    |    |-- LineItem: struct
|    |    |    |    |    |    |    |    |    |    |-- LineNumber: int
|    |    |    |    |    |    |    |    |    |    |-- LineText: string
|    |    |    |    |    |    |    |-- PackageBillableKeyedDimensions: struct
|    |    |    |    |    |    |    |    |-- Height: double
|    |    |    |    |    |    |    |    |-- Length: double
|    |    |    |    |    |    |    |    |-- Width: double
|    |    |    |    |    |    |    |-- PackageDimension: struct
|    |    |    |    |    |    |    |    |-- Height: double
|    |    |    |    |    |    |    |    |-- Length: double
|    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |-- Width: double
|    |    |    |    |    |    |    |-- PackageKeyedDimensions: struct
|    |    |    |    |    |    |    |    |-- Height: double
|    |    |    |    |    |    |    |    |-- Length: double
|    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |-- Width: double
|    |    |    |    |    |    |    |-- PackageQuantity: struct
|    |    |    |    |    |    |    |    |-- ActualQuantity: struct
|    |    |    |    |    |    |    |    |    |-- Quantity: int
|    |    |    |    |    |    |    |-- PackageWeight: struct
|    |    |    |    |    |    |    |    |-- ActualWeight: struct
|    |    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |    |-- Weight: double
|    |    |    |    |    |    |    |    |-- BilledWeight: struct
|    |    |    |    |    |    |    |    |    |-- UnitOfMeasure: string
|    |    |    |    |    |    |    |    |    |-- Weight: double
|    |    |    |    |    |    |    |    |-- BilledWeightType: double
|    |    |    |    |    |    |    |-- TrackingNumber: string
|    |    |    |    |    |    |    |-- Zone: int
|    |    |    |    |    |-- PayerRoleCd: int
|    |    |    |    |    |-- PickUpRecordNumber: long
|    |    |    |    |    |-- PortingReferences: struct
|    |    |    |    |    |    |-- Reference: array
|    |    |    |    |    |    |    |-- element: struct
|    |    |    |    |    |    |    |    |-- ReferenceNumber: string
|    |    |    |    |    |    |    |    |-- Sequence: int
|    |    |    |    |    |-- TransferDateCCYYMMDD: int
|    |-- TypeCode: string
|    |-- TypeDetailCode: double

What I want to change before writing the dynamic frame is to make the field ' ReceiptNumber ' a string type like below在编写动态框架之前我想要更改的是使字段“ ReceiptNumber ”成为字符串类型,如下所示

....
....
|    |-- ReceiptCurrencyCode: string
|    |-- ReceiptDateCCYYMMDD: int
|    |-- ReceiptNumber: string
|    |-- TaxVarianceAmount: double
....
....

Can it be possible via apply_mapping ?可以通过apply_mapping吗?

Is there any alternative solution?有没有替代的解决方案?

At last, I was able to solve it with a little bit of a different approach.最后,我能够用一点不同的方法来解决它。

So, to recap, I have a Glue ETL type job, written in python script.所以,回顾一下,我有一个 Glue ETL 类型的作业,用 python 脚本编写。

It was responsible for processing an XML file.它负责处理 XML 文件。 After processing the XML file, its schema was like the above, as I mentioned in the question.处理完 XML 文件后,它的架构如上,正如我在问题中提到的。

So, I wanted to change the type of one of its nodes which is 'ReceiptNumber' to string from int .因此,我想将其节点之一的类型“ReceiptNumber”从int更改为string

So, first I created a dynamic frame from the s3 file as usual所以,首先我像往常一样从 s3 文件创建了一个动态框架

d0  = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"rowTag": "ReceiptDetails"}, transformation_ctx = "d0")

Then, turned the dynamic frame into pyspark dataframe like below然后,将动态帧变成 pyspark dataframe 如下图

df = d0.toDF();

Then, I utilized the function written in the following link that how we can modify a nested struct field and its type.然后,我使用了下面链接中编写的 function,我们如何修改嵌套结构字段及其类型。

Pyspark: How to Modify a Nested Struct Field Pyspark:如何修改嵌套结构字段

From the function, I created a new_schema that utilized like below and converted it into a new dynamicframe like below.从 function 中,我创建了一个new_schema ,如下所示,并将其转换为一个新的动态框架,如下所示。

df = df.withColumn("Receipt_json", to_json("Receipt")).drop("Receipt")
df = df.withColumn("Receipt", from_json("Receipt_json", new_schema)).drop("Receipt_json")
d0 = DynamicFrame.fromDF(df, glueContext, "d0")

From the new dynamicframe which has a modified field 'ReceiptNumber' (from int to string ), I created a JSON schema like below.从具有修改字段“ReceiptNumber”(从intstring )的新动态框架中,我创建了一个 JSON 模式,如下所示。

receiptSchema = d0.schema()
withReceiptSchema = json.dumps(receiptSchema.jsonValue())

At last, I created the schema again like below with new schema and wrote it down in a JSON file like below.最后,我使用新模式再次创建了模式,如下所示,并将其写在 JSON 文件中,如下所示。

d0  = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options={"paths": [s3_path]}, format = "xml", format_options={"withSchema": withReceiptSchema, "rowTag": "ReceiptDetails"}, transformation_ctx = "d0")

# writing the down the data from above schema in a JSON file
glueContext.write_dynamic_frame.from_options(frame = d0, connection_type = "s3", connection_options = {"path": s3_write_path}, format = "json")

I hope, if someone falls into this sort of error or roadblock while working on Aws Glue Jobs, this answer could be of help.我希望,如果有人在从事 Aws Glue 工作时遇到这种错误或障碍,这个答案可能会有所帮助。

I had a similar problem where I had to add / delete and change the types of many columns.我有一个类似的问题,我必须添加/删除和更改许多列的类型。 For my case I ended up using the Map transformation that applies a function to all records of a DynamicFrame .对于我的情况,我最终使用了Map 转换,该转换将 function 应用于 DynamicFrame 的所有记录

inputDyf = glueContext.create_dynamic_frame_from_options(
    ...
)


def mapping(record: Dict[str, Any]):
    record["UpdatedAt"] = int(time.mktime(datetime.date.today().timetuple()))
    record["SomeVal"] = int(record["SomeVal"])

    # ... put, del and other dict operations
    return record 

mapped_dyF = Map.apply(frame=inputDyf, f=mapping)

Also you can specify the schema on the create_dynamic_frame_from_options method when using the XML format :您还可以在使用 XML 格式时在 create_dynamic_frame_from_options 方法上指定架构

schema = StructType([ 
  Field("name", StringType()),
])

datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={"withSchema": json.dumps(schema.jsonValue())},
    transformation_ctx = ""
)

# or directly as an string
datasource0 = create_dynamic_frame_from_options(
    connection_type, 
    connection_options={"paths": ["s3://xml_bucket/someprefix"]},
    format="xml", 
    format_options={
        "withSchema": """
{
  "dataType": "struct",
  "properties": {},
  "fields": [
    {
      "name": "name",
      "container": {
        "dataType": "string",
        "properties": {}
      },
      "properties": {}
    }
  ]
}
    """},
    transformation_ctx = ""
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM