使用 Pyspark 動態重命名 dataframe 列

Question

我正在讀取一個文件，其中列有值時可以是結構，否則在沒有數據時可以是字符串。 內聯示例 assign_to 和 group 是結構並具有數據。

root
 |-- number: string (nullable = true)
 |-- assigned_to: struct (nullable = true)
 |    |-- display_value: string (nullable = true)
 |    |-- link: string (nullable = true)
 |-- group: struct (nullable = true)
 |    |-- display_value: string (nullable = true)
 |    |-- link: string (nullable = true)

為了使 JSON 變平，我正在執行以下操作，

df23 = spark.read.parquet("dbfs:***/test1.parquet")
val_cols4 = []

#the idea is the day when the data type of the columns in struct I dynamically extract values otherwise create new columns and default to None.
for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(name+".display_value") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

現在，如果我必須從 dataframe df23 使用 val_cols4 到 select，所有結構列都具有相同的名稱“display_value”。

root
 |-- display_value: string (nullable = true)
 |-- display_value: string (nullable = true)

如何將列重命名為適當的值？ 我嘗試了以下，

for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append("col('"+name+".display_value').alias('"+name+"_value')") 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

當我在 dataframe 上執行 select 時，這不起作用並且出錯。

Answer 1

您可以 append 別名列 object 而不是val_cols4的字符串，例如

from pyspark.sql.functions import col, lit

val_cols4 = []

for name, cols in df23.dtypes:
  if 'struct' in cols:
    val_cols4.append(col(name+".display_value").alias(name+"_value")) 
  else:
    df23 = df23.withColumn(name+"_value", lit(None))

然后你可以 select 列，例如

newdf = df23.select(val_cols4)

使用 Pyspark 動態重命名 dataframe 列

問題描述

1 個解決方案

解決方案1
2 已采納 2021-04-26 19:52:13

使用 Pyspark 動態重命名 dataframe 列

問題描述

1 個解決方案

解決方案1 2 已采納 2021-04-26 19:52:13

解決方案1
2 已采納 2021-04-26 19:52:13