[英]Using pandas dataframe, how to group by multiple columns and adding new column with missing data
I want to group a 6-column dataframe for all rows with the same values in the first 3 columns, and then i want to add a new column with the value of the last column where the value of the 4th column = 0. 我想为前三列中具有相同值的所有行组成一个6列数据框,然后我要添加一个新列,其中最后一个列的值位于第4列的值= 0。
The original dataframe looks like this: 原始数据框如下所示:
A B C D E F G
0 11018 20190102 0 0 1546387200 37 34
1 11018 20190102 0 1 1546390800 33 36
2 11018 20190102 0 2 1546394400 19 19
3 11018 20190102 0 3 1546398000 17 26
4 11018 20190102 0 4 1546401600 16 26
5 11018 20190102 0 5 1546405200 13 23
6 11018 20190102 0 6 1546408800 11 15
7 11018 20190102 1200 0 1546430400 25 24
8 11018 20190102 1200 1 1546434000 21 3
9 11018 20190102 1200 2 1546437600 13 4
10 11018 20190102 1200 3 1546441200 7 3
11 11018 20190102 1200 4 1546444800 2 1
12 11018 20190102 1200 5 1546448400 -3 6
13 11018 20190102 1200 6 1546452000 -7 2
14 11035 20190103 0 0 1546473600 -15 -14
15 11035 20190103 0 1 1546477200 -17 -11
16 11035 20190103 0 2 1546480800 -20 -12
17 11035 20190103 0 3 1546484400 -23 -16
18 11035 20190103 0 4 1546488000 -26 -11
19 11035 20190103 0 5 1546491600 -28 -11
20 11035 20190103 0 6 1546495200 -27 -12
21 11031 20190103 1100 0 1546516800 0 1
22 11031 20190103 1100 1 1546520400 4 -7
23 11031 20190103 1100 2 1546524000 5 -6
24 11031 20190103 1100 3 1546527600 2 -16
25 11031 20190103 1100 4 1546531200 -3 -14
26 11031 20190103 1100 5 1546534800 -8 -12
27 11031 20190103 1100 6 1546538400 -12 -14
.
.
.
.
And the new dataframe should be: 新的数据框应为:
A B C D E F G H
0 11018 20190102 0 0 1546387200 37 34 34
1 11018 20190102 0 1 1546390800 33 36 34
2 11018 20190102 0 2 1546394400 19 19 34
3 11018 20190102 0 3 1546398000 17 26 34
4 11018 20190102 0 4 1546401600 16 26 34
5 11018 20190102 0 5 1546405200 13 23 34
6 11018 20190102 0 6 1546408800 11 15 34
7 11018 20190102 1200 0 1546430400 25 24 24
8 11018 20190102 1200 1 1546434000 21 3 24
9 11018 20190102 1200 2 1546437600 13 4 24
10 11018 20190102 1200 3 1546441200 7 3 24
11 11018 20190102 1200 4 1546444800 2 1 24
12 11018 20190102 1200 5 1546448400 -3 6 24
13 11018 20190102 1200 6 1546452000 -7 2 24
14 11035 20190103 0 0 1546473600 -15 -14 -14
15 11035 20190103 0 1 1546477200 -17 -11 -14
16 11035 20190103 0 2 1546480800 -20 -12 -14
17 11035 20190103 0 3 1546484400 -23 -16 -14
18 11035 20190103 0 4 1546488000 -26 -11 -14
19 11035 20190103 0 5 1546491600 -28 -11 -14
20 11035 20190103 0 6 1546495200 -27 -12 -14
21 11031 20190103 1100 0 1546516800 0 1 1
22 11031 20190103 1100 1 1546520400 4 -7 1
23 11031 20190103 1100 2 1546524000 5 -6 1
24 11031 20190103 1100 3 1546527600 2 -16 1
25 11031 20190103 1100 4 1546531200 -3 -14 1
26 11031 20190103 1100 5 1546534800 -8 -12 1
27 11031 20190103 1100 6 1546538400 -12 -14 1
.
.
.
.
Here I already got the solution in the form: 在这里,我已经获得了以下形式的解决方案:
def col_6(df):
df['H'] = df[df['D'] == 0]['G'].values[0]
return df
df.groupby(['A','B','C']).apply(col_6)
BUT: In some cases the row where value of the 4th column = 0 is missing. 但是:在某些情况下,缺少第四列的值= 0的行。 In such cases, the other rows of the groups (with 4th column = 1, 2,..) should be set to NaN. 在这种情况下,组的其他行(第4列= 1,2,..)应设置为NaN。
So, eg, original frame: 因此,例如原始框架:
A B C D E F G
0 11018 20190102 0 0 1546387200 37 34
1 11018 20190102 0 1 1546390800 33 36
2 11018 20190102 0 2 1546394400 19 19
3 11018 20190102 0 3 1546398000 17 26
4 11018 20190102 0 4 1546401600 16 26
5 11018 20190102 0 5 1546405200 13 23
6 11018 20190102 0 6 1546408800 11 15
7 11018 20190102 1200 1 1546434000 21 3
8 11018 20190102 1200 2 1546437600 13 4
9 11018 20190102 1200 3 1546441200 7 3
10 11018 20190102 1200 4 1546444800 2 1
11 11018 20190102 1200 5 1546448400 -3 6
12 11018 20190102 1200 6 1546452000 -7 2
The final frame should then look: 然后,最后一帧应如下所示:
A B C D E F G H
0 11018 20190102 0 0 1546387200 37 34 34
1 11018 20190102 0 1 1546390800 33 36 34
2 11018 20190102 0 2 1546394400 19 19 34
3 11018 20190102 0 3 1546398000 17 26 34
4 11018 20190102 0 4 1546401600 16 26 34
5 11018 20190102 0 5 1546405200 13 23 34
6 11018 20190102 0 6 1546408800 11 15 34
7 11018 20190102 1200 1 1546434000 21 3 nan
8 11018 20190102 1200 2 1546437600 13 4 nan
9 11018 20190102 1200 3 1546441200 7 3 nan
10 11018 20190102 1200 4 1546444800 2 1 nan
11 11018 20190102 1200 5 1546448400 -3 6 nan
12 11018 20190102 1200 6 1546452000 -7 2 nan
Is there an effective solution on how to solve this problem with the missing rows (based on the general solution above)? 是否有有效的解决方案,如何解决缺少行的问题(基于上述常规解决方案)?
Thanks a lot for help! 非常感谢您的帮助!
First filter only 0
rows and aggregate first
per groups, then add new column by DataFrame.join
: 首先仅过滤0
行,然后first
按组进行汇总,然后通过DataFrame.join
添加新列:
s = (df[df['D'] == 0].groupby(['A','B','C'])['G'].first()).rename('H')
df = df.join(s, on=['A','B','C'])
print (df)
A B C D E F G H
0 11018 20190102 0 0 1546387200 37 34 34.0
1 11018 20190102 0 1 1546390800 33 36 34.0
2 11018 20190102 0 2 1546394400 19 19 34.0
3 11018 20190102 0 3 1546398000 17 26 34.0
4 11018 20190102 0 4 1546401600 16 26 34.0
5 11018 20190102 0 5 1546405200 13 23 34.0
6 11018 20190102 0 6 1546408800 11 15 34.0
7 11018 20190102 1200 1 1546434000 21 3 NaN
8 11018 20190102 1200 2 1546437600 13 4 NaN
9 11018 20190102 1200 3 1546441200 7 3 NaN
10 11018 20190102 1200 4 1546444800 2 1 NaN
11 11018 20190102 1200 5 1546448400 -3 6 NaN
12 11018 20190102 1200 6 1546452000 -7 2 NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.