简体   繁体   English

根据列值选择行后,将一列添加到 DataFrame

[英]Add a column to a DataFrame after selecting rows based on column values

I have a weather forecasting dataset and I am interesd in the columns:我有一个天气预报数据集,我对以下列感兴趣:

  • period (values: 1,2,3) period (值:1,2,3)
  • temp2m : corresponds to a temperature 2 meters away from a weather station. temp2m :对应于距离气象站 2 米的温度。

p1 = new_df.where(new_df.period == 1).select([c for c in df.columns if c in ['period','temp2m']]).show()

This code for p1 gives the following (first 5): p1 的这段代码给出了以下内容(前 5 个):

+------+------+
|period|temp2m|
+------+------+
|     0|    12|
|     0|    13|
|     0|    11|
|     0|    13|
|     0|    10|
+------+------+

I would like to store the results of temp2m as temp2m_p1 in the main DataFrame new_df .我想将 temp2m 的结果作为temp2m temp2m_p1在主 DataFrame new_df Similarly I'd like to add temp2m_p2 and temp2m_p2 as well.同样,我也想添加temp2m_p2temp2m_p2 However I have trouble finding a solution to this problem on https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/ .但是,我在https://sparkbyexamples.com/pyspark/pyspark-add-new-column-to-dataframe/上找不到解决此问题的方法。

Please always provide a toy example and expected result.请始终提供玩具示例和预期结果。 Here is mine:这是我的:

new_df = pd.DataFrame({'period': [1, 1, 1, 2, 2, 2, 3, 3, 3],
                       'temp2m': [12, 13, 12, 20, 21, 22, 18, 18, 16]})
   period  temp2m
0       1      12
1       1      13
2       1      12
3       2      20
4       2      21
5       2      22
6       3      18
7       3      18
8       3      16

I believe you want:我相信你想要:

for p in new_df['period'].unique():
    new_df[f'temp2m_p{p}'] = new_df['temp2m'].where(new_df['period'] == p)

Which results in:结果是:

   period  temp2m  temp2m_p1  temp2m_p2  temp2m_p3
0       1      12       12.0        NaN        NaN
1       1      13       13.0        NaN        NaN
2       1      12       12.0        NaN        NaN
3       2      20        NaN       20.0        NaN
4       2      21        NaN       21.0        NaN
5       2      22        NaN       22.0        NaN
6       3      18        NaN        NaN       18.0
7       3      18        NaN        NaN       18.0
8       3      16        NaN        NaN       16.0

EDIT after the comments:评论后编辑

df_transformed = pd.concat((new_df[new_df['period'] == p]['temp2m'].rename(f'temp2m_{p}').reset_index(drop=True) for p in new_df['period'].unique()), axis=1)

That gives:这给出了:

   temp2m_1  temp2m_2  temp2m_3
0      12      20      18
1      13      21      18
2      12      22      16

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM