简体   繁体   English

使用Pandas DataFrame,如何按多列分组并添加缺少数据的新列

[英]Using pandas dataframe, how to group by multiple columns and adding new column with missing data

I want to group a 6-column dataframe for all rows with the same values in the first 3 columns, and then i want to add a new column with the value of the last column where the value of the 4th column = 0. 我想为前三列中具有相同值的所有行组成一个6列数据框,然后我要添加一个新列,其中最后一个列的值位于第4列的值= 0。

The original dataframe looks like this: 原始数据框如下所示:

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  0  1546430400  25   24
 8    11018  20190102  1200  1  1546434000  21    3
 9    11018  20190102  1200  2  1546437600  13    4
 10   11018  20190102  1200  3  1546441200   7    3
 11   11018  20190102  1200  4  1546444800   2    1
 12   11018  20190102  1200  5  1546448400  -3    6
 13   11018  20190102  1200  6  1546452000  -7    2
 14   11035  20190103     0  0  1546473600 -15 -14
 15   11035  20190103     0  1  1546477200 -17 -11
 16   11035  20190103     0  2  1546480800 -20 -12
 17   11035  20190103     0  3  1546484400 -23 -16
 18   11035  20190103     0  4  1546488000 -26 -11
 19   11035  20190103     0  5  1546491600 -28 -11
 20   11035  20190103     0  6  1546495200 -27 -12
 21   11031  20190103  1100  0  1546516800   0   1
 22   11031  20190103  1100  1  1546520400   4  -7
 23   11031  20190103  1100  2  1546524000   5  -6
 24   11031  20190103  1100  3  1546527600   2 -16
 25   11031  20190103  1100  4  1546531200  -3 -14
 26   11031  20190103  1100  5  1546534800  -8 -12
 27   11031  20190103  1100  6  1546538400 -12 -14
 .
 .
 .
 .

And the new dataframe should be: 新的数据框应为:

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  0  1546430400  25   24   24
 8    11018  20190102  1200  1  1546434000  21    3   24
 9    11018  20190102  1200  2  1546437600  13    4   24
 10   11018  20190102  1200  3  1546441200   7    3   24
 11   11018  20190102  1200  4  1546444800   2    1   24
 12   11018  20190102  1200  5  1546448400  -3    6   24
 13   11018  20190102  1200  6  1546452000  -7    2   24
 14   11035  20190103     0  0  1546473600 -15 -14   -14
 15   11035  20190103     0  1  1546477200 -17 -11   -14
 16   11035  20190103     0  2  1546480800 -20 -12   -14
 17   11035  20190103     0  3  1546484400 -23 -16   -14
 18   11035  20190103     0  4  1546488000 -26 -11   -14
 19   11035  20190103     0  5  1546491600 -28 -11   -14
 20   11035  20190103     0  6  1546495200 -27 -12   -14
 21   11031  20190103  1100  0  1546516800   0   1     1
 22   11031  20190103  1100  1  1546520400   4  -7     1
 23   11031  20190103  1100  2  1546524000   5  -6     1
 24   11031  20190103  1100  3  1546527600   2 -16     1
 25   11031  20190103  1100  4  1546531200  -3 -14     1
 26   11031  20190103  1100  5  1546534800  -8 -12     1
 27   11031  20190103  1100  6  1546538400 -12 -14     1
 .
 .
 .
 .

Here I already got the solution in the form: 在这里,我已经获得了以下形式的解决方案:

def col_6(df):
     df['H'] = df[df['D'] == 0]['G'].values[0]
     return df
df.groupby(['A','B','C']).apply(col_6)

BUT: In some cases the row where value of the 4th column = 0 is missing. 但是:在某些情况下,缺少第四列的值= 0的行。 In such cases, the other rows of the groups (with 4th column = 1, 2,..) should be set to NaN. 在这种情况下,组的其他行(第4列= 1,2,..)应设置为NaN。

So, eg, original frame: 因此,例如原始框架:

          A         B     C  D           E   F    G
 0    11018  20190102     0  0  1546387200  37   34
 1    11018  20190102     0  1  1546390800  33   36
 2    11018  20190102     0  2  1546394400  19   19
 3    11018  20190102     0  3  1546398000  17   26
 4    11018  20190102     0  4  1546401600  16   26
 5    11018  20190102     0  5  1546405200  13   23
 6    11018  20190102     0  6  1546408800  11   15
 7    11018  20190102  1200  1  1546434000  21    3
 8    11018  20190102  1200  2  1546437600  13    4
 9    11018  20190102  1200  3  1546441200   7    3
 10   11018  20190102  1200  4  1546444800   2    1
 11   11018  20190102  1200  5  1546448400  -3    6
 12   11018  20190102  1200  6  1546452000  -7    2

The final frame should then look: 然后,最后一帧应如下所示:

          A         B     C  D           E   F    G    H
 0    11018  20190102     0  0  1546387200  37   34   34
 1    11018  20190102     0  1  1546390800  33   36   34
 2    11018  20190102     0  2  1546394400  19   19   34
 3    11018  20190102     0  3  1546398000  17   26   34
 4    11018  20190102     0  4  1546401600  16   26   34
 5    11018  20190102     0  5  1546405200  13   23   34
 6    11018  20190102     0  6  1546408800  11   15   34
 7    11018  20190102  1200  1  1546434000  21    3   nan
 8    11018  20190102  1200  2  1546437600  13    4   nan
 9    11018  20190102  1200  3  1546441200   7    3   nan
 10   11018  20190102  1200  4  1546444800   2    1   nan
 11   11018  20190102  1200  5  1546448400  -3    6   nan
 12   11018  20190102  1200  6  1546452000  -7    2   nan

Is there an effective solution on how to solve this problem with the missing rows (based on the general solution above)? 是否有有效的解决方案,如何解决缺少行的问题(基于上述常规解决方案)?

Thanks a lot for help! 非常感谢您的帮助!

First filter only 0 rows and aggregate first per groups, then add new column by DataFrame.join : 首先仅过滤0行,然后first按组进行汇总,然后通过DataFrame.join添加新列:

s = (df[df['D'] == 0].groupby(['A','B','C'])['G'].first()).rename('H')
df = df.join(s, on=['A','B','C'])
print (df)
        A         B     C  D           E   F   G     H
0   11018  20190102     0  0  1546387200  37  34  34.0
1   11018  20190102     0  1  1546390800  33  36  34.0
2   11018  20190102     0  2  1546394400  19  19  34.0
3   11018  20190102     0  3  1546398000  17  26  34.0
4   11018  20190102     0  4  1546401600  16  26  34.0
5   11018  20190102     0  5  1546405200  13  23  34.0
6   11018  20190102     0  6  1546408800  11  15  34.0
7   11018  20190102  1200  1  1546434000  21   3   NaN
8   11018  20190102  1200  2  1546437600  13   4   NaN
9   11018  20190102  1200  3  1546441200   7   3   NaN
10  11018  20190102  1200  4  1546444800   2   1   NaN
11  11018  20190102  1200  5  1546448400  -3   6   NaN
12  11018  20190102  1200  6  1546452000  -7   2   NaN

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Pandas DataFrame,如何按多列分组并添加新列 - Using pandas dataframe, how to group by multiple columns and adding new column Pandas dataframe,如何按多列分组并为特定列应用总和并添加新的计数列? - Pandas dataframe, how can I group by multiple columns and apply sum for specific column and add new count column? 通过添加新列,使用 Pandas 数据框将数据添加到 csv - Adding data to csv using pandas dataframe, by adding new column 如何将数据从 Pandas 数据帧的一列拆分为新数据帧的多列 - How do I split data out from one column of a pandas dataframe into multiple columns of a new dataframe 如何将单列pandas数据帧拆分为多个列? - How to split single column of pandas dataframe into multiple columns with group? 通过单个列对多个列进行分组— Pandas Dataframe - Group Multiple Columns by a Single Column — Pandas Dataframe 熊猫:有效地将多个列添加到新的数据框 - Pandas: Efficiently Adding Multiple Columns To A New Dataframe 如何将单个 Pandas Dataframe 列的内容拆分为多个新列 - How to Split the Contents of a Single Pandas Dataframe Column into Multiple New Columns 如何使用来自 pandas DataFrame 的两个单独列的数据在 python 中创建一个新列? - How to creating a new column in python using data from two separate columns of a pandas DataFrame? 将系列作为新列添加到 pandas dataframe 时缺少行 - Missing rows when adding a series as a new column to a pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM