简体   繁体   English

使用 Pyspark 动态重命名 dataframe 列

[英]Dynamically renaming dataframe columns using Pyspark

I'm reading a file where columns can be struct when they have a value else can be string when there is no data.我正在读取一个文件,其中列有值时可以是结构,否则在没有数据时可以是字符串。 Inline example assigned_to and group are struct and have data.内联示例 assign_to 和 group 是结构并具有数据。

root
 |-- number: string (nullable = true)
 |-- assigned_to: struct (nullable = true)
 |    |-- display_value: string (nullable = true)
 |    |-- link: string (nullable = true)
 |-- group: struct (nullable = true)
 |    |-- display_value: string (nullable = true)
 |    |-- link: string (nullable = true)

To flatten the JSON I'm doing the following,为了使 JSON 变平,我正在执行以下操作,

df23 = spark.read.parquet("dbfs:***/test1.parquet")
val_cols4 = []

#the idea is the day when the data type of the columns in struct I dynamically extract values otherwise create new columns and default to None.
for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(name+".display_value") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

Now if I had to use val_cols4 to select from dataframe df23 all the struct columns have the same name "display_value".现在,如果我必须从 dataframe df23 使用 val_cols4 到 select,所有结构列都具有相同的名称“display_value”。

root
 |-- display_value: string (nullable = true)
 |-- display_value: string (nullable = true)

How do I rename the columns to appropriate values?如何将列重命名为适当的值? I tried the following,我尝试了以下,

for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append("col('"+name+".display_value').alias('"+name+"_value')") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

This doesn't work and errors out when I do a select on the dataframe.当我在 dataframe 上执行 select 时,这不起作用并且出错。

You can append an aliased column object rather than a string to val_cols4 , eg您可以 append 别名列 object 而不是val_cols4的字符串,例如

from pyspark.sql.functions import col, lit

val_cols4 = []

for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(col(name+".display_value").alias(name+"_value")) 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

Then you can select the columns, eg然后你可以 select 列,例如

newdf = df23.select(val_cols4)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM