如何重命名 PySpark 中的特定列？

Question

我在 PySpark 中有一個 dataframe，它是groupBy和agg的結果。 像這樣：

df1 = df.groupBy(['data', 'id']).pivot('type').agg(F.sum('value').alias("Values"), F.count('value').alias("Quantity"))

但我需要將別名（“Values”和“Quantity”）作為這些列的前綴，而不是作為后綴。

這是 dataframe 的示例。

我的腳本的結果：

數據	ID	some_type_Values	some_type_Quantity
2022-01-01	1234	12.50	2

渴望 output：

數據	ID	值 some_type	數量 some_type
2022-01-01	1234	12.50	2

到目前為止我已經嘗試過：

selected = df1.select([s for s in df1.columns if 'Values' in s])
select_volume = [col(col_name).alias("Values " + col_name)  for col_name in selected.columns]
df2 = df1.select(*select_volume)

這有效，但分裂了我的 dataframe。 而且我還需要在列的末尾_Values和_Quantity 。

如何重命名每個操作的選定列，並從每個操作的末尾刪除此別名？

Answer 1

Python 的rfind可能很有用。

示例數據框：

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('2022-01-01', 1234, 'some_type_1', 2),
     ('2022-01-01', 1234, 'some_type_2', 3)],
    ['data', 'id', 'type', 'value'])
df1 = df.groupBy(['data', 'id']).pivot('type').agg(F.sum('value').alias("Values"), F.count('value').alias("Quantity"))
df1.show()
# +----------+----+------------------+--------------------+------------------+--------------------+
# |      data|  id|some_type_1_Values|some_type_1_Quantity|some_type_2_Values|some_type_2_Quantity|
# +----------+----+------------------+--------------------+------------------+--------------------+
# |2022-01-01|1234|                 2|                   1|                 3|                   1|
# +----------+----+------------------+--------------------+------------------+--------------------+

重命名腳本：

df1 = df1.select(
    *['data', 'id'],
    *[F.col(c).alias(f"{c[c.rfind('_')+1:]} {c[:c.rfind('_')]}") for c in df1.columns if c not in ['data', 'id']]
)
df1.show()
# +----------+----+------------------+--------------------+------------------+--------------------+
# |      data|  id|Values some_type_1|Quantity some_type_1|Values some_type_2|Quantity some_type_2|
# +----------+----+------------------+--------------------+------------------+--------------------+
# |2022-01-01|1234|                 2|                   1|                 3|                   1|
# +----------+----+------------------+--------------------+------------------+--------------------+

toDF也是可能的，它不那么冗長，但在某些情況下它更容易出錯。

df1 = df1.toDF(
    *['data', 'id'],
    *[f"{c[c.rfind('_')+1:]} {c[:c.rfind('_')]}" for c in df1.columns if c not in ['data', 'id']]
)
df1.show()
# +----------+----+------------------+--------------------+------------------+--------------------+
# |      data|  id|Values some_type_1|Quantity some_type_1|Values some_type_2|Quantity some_type_2|
# +----------+----+------------------+--------------------+------------------+--------------------+
# |2022-01-01|1234|                 2|                   1|                 3|                   1|
# +----------+----+------------------+--------------------+------------------+--------------------+

如何重命名 PySpark 中的特定列？

問題描述

1 個解決方案

解決方案1
0 2022-07-26 20:13:53

如何重命名 PySpark 中的特定列？

問題描述

1 個解決方案

解決方案1 0 2022-07-26 20:13:53

解決方案1
0 2022-07-26 20:13:53