根据列值选择行后，将一列添加到 DataFrame

Question

我有一个天气预报数据集，我对以下列感兴趣：

period （值：1,2,3）
temp2m ：对应于距离气象站 2 米的温度。

p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()

p1 的这段代码给出了以下内容（前 5 个）：

+------+------+
|period|temp2m|
+------+------+
|     0|    12|
|     0|    13|
|     0|    11|
|     0|    13|
|     0|    10|
+------+------+

我想将 temp2m 的结果作为temp2m temp2m_p1在主 DataFrame new_df 。 同样，我也想添加temp2m_p2和temp2m_p2 。 但是，我在https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/上找不到解决此问题的方法。

Answer 1

请始终提供玩具示例和预期结果。 这是我的：

new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                       'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})

   period  temp2m
0       1      12
1       1      13
2       1      12
3       2      20
4       2      21
5       2      22
6       3      18
7       3      18
8       3      16

我相信你想要：

for p in new_df['period'].unique():
    new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)

结果是：

   period  temp2m  temp2m_p1  temp2m_p2  temp2m_p3
0       1      12       12.0        NaN        NaN
1       1      13       13.0        NaN        NaN
2       1      12       12.0        NaN        NaN
3       2      20        NaN       20.0        NaN
4       2      21        NaN       21.0        NaN
5       2      22        NaN       22.0        NaN
6       3      18        NaN        NaN       18.0
7       3      18        NaN        NaN       18.0
8       3      16        NaN        NaN       16.0

评论后编辑：

df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)

这给出了：

   temp2m_1  temp2m_2  temp2m_3
0      12      20      18
1      13      21      18
2      12      22      16

根据列值选择行后，将一列添加到 DataFrame

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-05-05 08:26:07

根据列值选择行后，将一列添加到 DataFrame

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-05-05 08:26:07

解决方案1
0 已采纳 2022-05-05 08:26:07