使用Pandas DataFrame，如何按多列分组并添加缺少数据的新列

Question

I want to group a 6-column dataframe for all rows with the same values in the first 3 columns, and then i want to add a new column with the value of the last column where the value of the 4th column = 0. 我想为前三列中具有相同值的所有行组成一个6列数据框，然后我要添加一个新列，其中最后一个列的值位于第4列的值= 0。

The original dataframe looks like this: 原始数据框如下所示：

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  0  1546430400  25   24
 8    11018  20190102  1200  1  1546434000  21    3
 9    11018  20190102  1200  2  1546437600  13    4
 10   11018  20190102  1200  3  1546441200   7    3
 11   11018  20190102  1200  4  1546444800   2    1
 12   11018  20190102  1200  5  1546448400  -3    6
 13   11018  20190102  1200  6  1546452000  -7    2
 14   11035  20190103     0  0  1546473600 -15 -14
 15   11035  20190103     0  1  1546477200 -17 -11
 16   11035  20190103     0  2  1546480800 -20 -12
 17   11035  20190103     0  3  1546484400 -23 -16
 18   11035  20190103     0  4  1546488000 -26 -11
 19   11035  20190103     0  5  1546491600 -28 -11
 20   11035  20190103     0  6  1546495200 -27 -12
 21   11031  20190103  1100  0  1546516800   0   1
 22   11031  20190103  1100  1  1546520400   4  -7
 23   11031  20190103  1100  2  1546524000   5  -6
 24   11031  20190103  1100  3  1546527600   2 -16
 25   11031  20190103  1100  4  1546531200  -3 -14
 26   11031  20190103  1100  5  1546534800  -8 -12
 27   11031  20190103  1100  6  1546538400 -12 -14
 .
 .
 .
 .

And the new dataframe should be: 新的数据框应为：

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  0  1546430400  25   24   24
 8    11018  20190102  1200  1  1546434000  21    3   24
 9    11018  20190102  1200  2  1546437600  13    4   24
 10   11018  20190102  1200  3  1546441200   7    3   24
 11   11018  20190102  1200  4  1546444800   2    1   24
 12   11018  20190102  1200  5  1546448400  -3    6   24
 13   11018  20190102  1200  6  1546452000  -7    2   24
 14   11035  20190103     0  0  1546473600 -15 -14   -14
 15   11035  20190103     0  1  1546477200 -17 -11   -14
 16   11035  20190103     0  2  1546480800 -20 -12   -14
 17   11035  20190103     0  3  1546484400 -23 -16   -14
 18   11035  20190103     0  4  1546488000 -26 -11   -14
 19   11035  20190103     0  5  1546491600 -28 -11   -14
 20   11035  20190103     0  6  1546495200 -27 -12   -14
 21   11031  20190103  1100  0  1546516800   0   1     1
 22   11031  20190103  1100  1  1546520400   4  -7     1
 23   11031  20190103  1100  2  1546524000   5  -6     1
 24   11031  20190103  1100  3  1546527600   2 -16     1
 25   11031  20190103  1100  4  1546531200  -3 -14     1
 26   11031  20190103  1100  5  1546534800  -8 -12     1
 27   11031  20190103  1100  6  1546538400 -12 -14     1
 .
 .
 .
 .

Here I already got the solution in the form: 在这里，我已经获得了以下形式的解决方案：

def col_6(df):
     df['H'] = df[df['D'] == 0]['G'].values[0]
     return df
df.groupby(['A','B','C']).apply(col_6)

BUT: In some cases the row where value of the 4th column = 0 is missing. 但是：在某些情况下，缺少第四列的值= 0的行。 In such cases, the other rows of the groups (with 4th column = 1, 2,..) should be set to NaN. 在这种情况下，组的其他行（第4列= 1，2，..）应设置为NaN。

So, eg, original frame: 因此，例如原始框架：

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  1  1546434000  21    3
 8    11018  20190102  1200  2  1546437600  13    4
 9    11018  20190102  1200  3  1546441200   7    3
 10   11018  20190102  1200  4  1546444800   2    1
 11   11018  20190102  1200  5  1546448400  -3    6
 12   11018  20190102  1200  6  1546452000  -7    2

The final frame should then look: 然后，最后一帧应如下所示：

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  1  1546434000  21    3   nan
 8    11018  20190102  1200  2  1546437600  13    4   nan
 9    11018  20190102  1200  3  1546441200   7    3   nan
 10   11018  20190102  1200  4  1546444800   2    1   nan
 11   11018  20190102  1200  5  1546448400  -3    6   nan
 12   11018  20190102  1200  6  1546452000  -7    2   nan

Is there an effective solution on how to solve this problem with the missing rows (based on the general solution above)? 是否有有效的解决方案，如何解决缺少行的问题（基于上述常规解决方案）？

Thanks a lot for help! 非常感谢您的帮助！

Answer 1

First filter only 0 rows and aggregate first per groups, then add new column by DataFrame.join : 首先仅过滤0行，然后first按组进行汇总，然后通过DataFrame.join添加新列：

s = (df[df['D'] == 0].groupby(['A','B','C'])['G'].first()).rename('H')
df = df.join(s, on=['A','B','C'])
print (df)
        A         B     C  D           E   F   G     H
0   11018  20190102     0  0  1546387200  37  34  34.0
1   11018  20190102     0  1  1546390800  33  36  34.0
2   11018  20190102     0  2  1546394400  19  19  34.0
3   11018  20190102     0  3  1546398000  17  26  34.0
4   11018  20190102     0  4  1546401600  16  26  34.0
5   11018  20190102     0  5  1546405200  13  23  34.0
6   11018  20190102     0  6  1546408800  11  15  34.0
7   11018  20190102  1200  1  1546434000  21   3   NaN
8   11018  20190102  1200  2  1546437600  13   4   NaN
9   11018  20190102  1200  3  1546441200   7   3   NaN
10  11018  20190102  1200  4  1546444800   2   1   NaN
11  11018  20190102  1200  5  1546448400  -3   6   NaN
12  11018  20190102  1200  6  1546452000  -7   2   NaN

使用Pandas DataFrame，如何按多列分组并添加缺少数据的新列

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-01-25 09:53:59

使用Pandas DataFrame，如何按多列分组并添加缺少数据的新列

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-01-25 09:53:59

解决方案1
1 已采纳 2019-01-25 09:53:59