简体   繁体   English

如何在 AWS Glue 中正确重命名动态数据帧的列?

[英]How to properly rename columns of dynamic dataframe in AWS Glue?

I load JSON data and use relationalize method on dynamic dataframe to flatten the otherwise nested JSON object and saving it into parquet format.我加载 JSON 数据并在动态数据帧上使用关系化方法来展平原本嵌套的 JSON 对象并将其保存为镶木地板格式。 The problem is that once saved into parquet format for faster Athena queries, the column names contain dots, which is against the Athena SQL query syntax and thus I am unable to make column specific queries.问题是,一旦为更快的 Athena 查询保存为 parquet 格式,列名包含点,这违反了 Athena SQL 查询语法,因此我无法进行特定于列的查询。

In order to tackle this problem I also rename the column names in the Glue job to exclude the dots and put underscores instead.为了解决这个问题,我还重命名了 Glue 作业中的列名以排除点并使用下划线代替。 My question is which approach of the two would be better and why?我的问题是两者中的哪种方法会更好,为什么? (Efficiency- memory? execution speed on nodes? etc.). (效率-内存?节点上的执行速度?等)。

Also given the horrible aws glue documentation I could not come up with a dynamic frame only solution.还考虑到可怕的 aws 胶水文档,我无法提出仅动态框架的解决方案。 I have problems getting the column names in dynamic fashion, thus I am utilizing toDF().我在以动态方式获取列名时遇到问题,因此我正在使用 toDF()。

1) First approach is around getting the column names from df extracted from dynamic df 1) 第一种方法是从动态 df 中提取的 df 中获取列名

relationalize1 = Relationalize.apply(frame=datasource0, transformation_ctx="relationalize1").select("roottable")
    df_relationalize1 = relationalize1.toDF()
    for field in df_relationalize1.schema.fields:
        relationalize1 = RenameField.apply(frame = relationalize1, old_name = "`"+field.name+"`", new_name = field.name.replace(".","_"), transformation_ctx = "renamefield_" + field.name)

2) Second approach would be to extract the df from dynamic df and perform the rename field on the pyspark df (instead of dynamic df), to then convert back to dynamic df and save it in parquet format. 2)第二种方法是从动态df中提取df并在pyspark df(而不是动态df)上执行重命名字段,然后转换回动态df并将其保存为镶木地板格式。

Is there a better approach?有没有更好的方法? Can a crawler rename columns?爬虫可以重命名列吗? How fast is .fromDF() method? .fromDF() 方法有多快? Is there a better documentation on functions and methods than the pdf developer guide?是否有比 pdf 开发人员指南更好的函数和方法文档?

The question specifically asks about renaming:该问题专门询问重命名:

(a) Convert to DataFrame . (a) 转换为DataFrame
(b) Create new_columns array with desired column names in same order as old_columns . (b) 以与old_columns相同的顺序创建具有所需列名的new_columns数组。
(c) Overwrite and persist new_columns using functools.reduce() and pyspark.withColumnRenamed() . (c) 使用functools.reduce()pyspark.withColumnRenamed()覆盖和持久化new_columns
(d) Convert back to DynamicFrame . (d) 转换回DynamicFrame

 from awsglue.job import Job from awsglue.context import GlueContext from pyspark.context import SparkContext from functools import reduce JOB_NAME = "csv_to_parquet" sc = SparkContext() glue_context = GlueContext(sc) job = Job(glue_context) job.init(JOB_NAME) # Create DynamicFrame datasource = glue_context.create_dynamic_frame_from_options( connection_type="s3", format="csv", connection_options={"paths": ["s3://path/to/source/file.csv"]}, format_options={"withHeader": True, "separator": chr(44)}, # comma delimited ) # (a) Convert to DataFrame df = datasource.toDF() # (b) Create array with desired columns old_columns = df.schema.names new_columns = [ field.lower().replace(" ", "_").replace(".", "_") for field in old_columns ] # (c) Overwrite and persist `new_columns` df = reduce( lambda df, idx: df.withColumnRenamed(old_columns[idx], new_columns[idx]), range(len(old_columns)), df, ) # (d) Convert back to DynamicFrame datasource = datasource.fromDF(df, glue_context, "datasource") # Write DynamicFrame as Parquet datasink = glue_context.write_dynamic_frame_from_options( frame=datasource, connection_type="s3", connection_options={"path": "s3://path/to/target/prefix/"}, format="parquet", )

Blockquote块引用

You can access the schema of the DynamicFrame with the schema attribute.您可以使用schema属性访问 DynamicFrame 的schema From that you can define a mapping on any columns containing .从中您可以定义任何包含. to new columns that use _ .到使用_新列。 You just need to know the type and names of the columns to do this with the ApplyMapping transformation.您只需要知道列的类型和名称即可使用ApplyMapping转换执行此操作

Maybe:也许:

from awsglue.transforms import ApplyMapping    

# construct renaming mapping for ApplyMapping
mappings = list()
for field in df.schema.fields:
    if '.' in field.name:
        dtype = field.dataType.typeName()
        mappings.append((field.name, dtype, field.name.replace('.', '_'), dtype))

# apply mapping
renamed = ApplyMapping(frame=df, mappings=mappings)    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM