简体   繁体   English

在 pandas 数据框中如何删除一些汇总的重复行

[英]In pandas data frame how to remove some summarized duplicates rows

I have the below pandas data frame.我有以下 pandas 数据框。

d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','G-00002','G-00001','G-00002','G-00001','G-00001','G-00001','G-00001'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4],'Name': ['aasd', 'aasd','aabsd','aabsd','aabsd','aasd','aabsd','aasd','aabsd','aasd','aasd','aasd','aasd'],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}

dff = pd.DataFrame(data=d)
dff
    id1         ID      col1    Goal    col2   col3  Name   Date
0   85643   G-00001     671     NaN     793    500  aasd    2021-06-13
1   85644   G-00001     2       56.0000 4      22   aasd    2021-06-13
2   85643   G-00002     5       78.0000 8      89   aabsd   2021-06-14
3   8564312 G-00002     3       NaN     32     33   aabsd   2021-06-13
4   8564314 G-00002     4       89.0000 43     44   aabsd   2021-06-14
5   85645   G-00001     5       73.0000 55     55   aasd    2021-06-15
6   8564316 G-00002     60      NaN     610    60   aabsd   2021-06-15
7   85646   G-00001     0       NaN     0      1    aasd    2021-06-13
8   8564318 G-00002     0       NaN     0      5    aabsd   2021-06-16
9   85647   G-00001     6       NaN     16     6    aasd    2021-06-13
10  85648   G-00001     3       NaN     23     3    aasd    2021-06-13
11  85649   G-00001     2       34.0000 72     2    aasd    2021-06-13
12  85655   G-00001     4       NaN     48     4    aasd    2021-06-16

I want to summarize some of the columns and add them back to the same data frame based on some ids in the "id1" column.我想总结一些列,并根据“id1”列中的一些 id 将它们添加回同一个数据框。 Then I want to give a new name to the "ID" column when we add that row.然后,当我们添加该行时,我想给“ID”列一个新名称。

Based on below "id1" column ids I want to summarize only "col1","col2","col3",and "Name" columns.基于下面的“id1”列ID,我只想总结“col1”、“col2”、“col3”和“Name”列。 Then I want to add that row back to the same data frame and give a new id for the "ID" column.然后我想将该行添加回同一个数据框,并为“ID”列提供一个新的 id。 However, I want to do that by the "ID" column for my calculations.但是,我想通过“ID”列进行计算。 (Like groupby) (如群比)

In the below function I'm aggregating the sum for col1,col2, and col3 with average.在下面的 function 中,我将 col1、col2 和 col3 的总和与平均值相加。

ID_list = ['G-00001','G-00002']
def sumarizeValues(Filter,Orginal):
    b65 = ['85643','85645', '85655','85646'] # for G-00001
    b66 = ['85643','85645','85647','85648','85649','85644'] # for G-00001
    b67 = ['85643','8564312','8564314','8564316','8564318'] # for G-00002

    # create a dictionary
    d_map = {'b65': b65, 'b66': b66, 'b67': b67}
    # dictionary comprehension
    df = pd.DataFrame({k: dff[dff['id1'].isin(v)].agg({'col1': sum, 'col2': sum,
                                                   'col3': 'mean', 'Name': min})
                       for k,v in d_map.items()}).T.reset_index()
    # rename the columns
    df = df.rename(columns={'index': 'ID'})
    # concat the two frames
    #pd.concat([dff, df]).reset_index(drop=True)
    Orginal = pd.concat([Orginal, df]).reset_index(drop=True)
    return Orginal

I only want to create a summarized row if the ID has values in slices.如果 ID 在切片中有值,我只想创建一个汇总行。 for example in the ID_list first I'm taking 'G-00001' and creating summarized rows based on id1 slicers(b65,b66,b67).例如,首先在 ID_list 中,我使用“G-00001”并基于 id1 切片器(b65,b66,b67)创建汇总行。 However, the function I created giving me some additional rows like below.但是,我创建的 function 给了我一些额外的行,如下所示。 How can I eliminate those rows?我怎样才能消除这些行?

So final data frame look like below所以最终的数据框如下所示

ID_list = ['G-00001','G-00002']
def abcFunction(dff):
    for ID in ID_list:
        print(ID)
        IDlist =[ID]
        print(IDlist)
        Filter =  dff[dff['ID'].isin(IDlist)]
        dff = sumarizeValues(Filter,dff)
        print(dff)
        ## calculation
        ## calculation
        ## calculation

abcFunction(dff)
# So for the first ID (G-00001), I actually don't need the last row(15th index containing b67).
# I only need that row for the G-00002 calcualtion.

G-00001
['G-00001']
        id1       ID col1    Goal  col2     col3   Name        Date
0     85643  G-00001  671     NaN   793      500   aasd  2021-06-13
1     85644  G-00001    2 56.0000     4       22   aasd  2021-06-13
2     85643  G-00002    5 78.0000     8       89  aabsd  2021-06-14
3   8564312  G-00002    3     NaN    32       33  aabsd  2021-06-13
4   8564314  G-00002    4 89.0000    43       44  aabsd  2021-06-14
5     85645  G-00001    5 73.0000    55       55   aasd  2021-06-15
6   8564316  G-00002   60     NaN   610       60  aabsd  2021-06-15
7     85646  G-00001    0     NaN     0        1   aasd  2021-06-13
8   8564318  G-00002    0     NaN     0        5  aabsd  2021-06-16
9     85647  G-00001    6     NaN    16        6   aasd  2021-06-13
10    85648  G-00001    3     NaN    23        3   aasd  2021-06-13
11    85649  G-00001    2 34.0000    72        2   aasd  2021-06-13
12    85655  G-00001    4     NaN    48        4   aasd  2021-06-16
13      NaN      b65  685     NaN   904 129.8000  aabsd         NaN
14      NaN      b66  694     NaN   971  96.7143  aabsd         NaN
15      NaN      b67  743     NaN  1486 121.8333  aabsd         NaN


# When I ran it for G-00002, it actually contains all the other rows created. 
#So for the second ID (G-00002), I actually don't need the row(13th index to 17th index 
#containing b65,b66, and b67 in index 15). Because G-00002 doesn't 
#contain any values in b65,b66.  

G-00002
['G-00002']
        id1       ID col1    Goal  col2     col3   Name        Date
0     85643  G-00001  671     NaN   793      500   aasd  2021-06-13
1     85644  G-00001    2 56.0000     4       22   aasd  2021-06-13
2     85643  G-00002    5 78.0000     8       89  aabsd  2021-06-14
3   8564312  G-00002    3     NaN    32       33  aabsd  2021-06-13
4   8564314  G-00002    4 89.0000    43       44  aabsd  2021-06-14
5     85645  G-00001    5 73.0000    55       55   aasd  2021-06-15
6   8564316  G-00002   60     NaN   610       60  aabsd  2021-06-15
7     85646  G-00001    0     NaN     0        1   aasd  2021-06-13
8   8564318  G-00002    0     NaN     0        5  aabsd  2021-06-16
9     85647  G-00001    6     NaN    16        6   aasd  2021-06-13
10    85648  G-00001    3     NaN    23        3   aasd  2021-06-13
11    85649  G-00001    2 34.0000    72        2   aasd  2021-06-13
12    85655  G-00001    4     NaN    48        4   aasd  2021-06-16
13      NaN      b65  685     NaN   904 129.8000  aabsd         NaN
14      NaN      b66  694     NaN   971  96.7143  aabsd         NaN
15      NaN      b67  743     NaN  1486 121.8333  aabsd         NaN
16      NaN      b65  685     NaN   904 129.8000  aabsd         NaN
17      NaN      b66  694     NaN   971  96.7143  aabsd         NaN
18      NaN      b67  743     NaN  1486 121.8333  aabsd         NaN

Is it possible to do that?有可能这样做吗? Any help is appreciated!任何帮助表示赞赏! Thanks in advance!提前致谢!

Change your mapping dictionary so each ID is mapped to the required new IDs.更改映射字典,以便将每个 ID 映射到所需的新 ID。 Try:尝试:

def summarizeValues(df, ID):
    mapper = {"G-00001": {"b65": ['85643', '85645', '85655','85646'],
                         "b66": ['85643', '85645', '85647', '85648', '85649', '85644']},
              "G-00002": {"b67": ['85643', '8564312', '8564314', '8564316', '8564318']}
              }
    
    # dictionary comprehension
    dff = pd.DataFrame({k: df[df['id1'].isin(v)].agg({'col1': sum, 
                                                      'col2': sum, 
                                                      'col3': 'mean', 
                                                      'Name': min})
                        for k,v in mapper[ID].items()})
    
    dff = dff.T.reset_index().rename(columns={'index': 'ID'})
    
    return pd.concat([df, dff]).reset_index(drop=True)

output = dict()
for ID in ['G-00001','G-00002']:
    output[ID] = summarizeValues(df, ID)
Output: Output:
>>> output["G-00001"]
        id1       ID col1  Goal col2       col3   Name        Date
0     85643  G-00001  671   NaN  793        500   aasd  2021-06-13
1     85644  G-00001    2  56.0    4         22   aasd  2021-06-13
2     85643  G-00002    5  78.0    8         89  aabsd  2021-06-14
3   8564312  G-00002    3   NaN   32         33  aabsd  2021-06-13
4   8564314  G-00002    4  89.0   43         44  aabsd  2021-06-14
5     85645  G-00001    5  73.0   55         55   aasd  2021-06-15
6   8564316  G-00002   60   NaN  610         60  aabsd  2021-06-15
7     85646  G-00001    0   NaN    0          1   aasd  2021-06-13
8   8564318  G-00002    0   NaN    0          5  aabsd  2021-06-16
9     85647  G-00001    6   NaN   16          6   aasd  2021-06-13
10    85648  G-00001    3   NaN   23          3   aasd  2021-06-13
11    85649  G-00001    2  34.0   72          2   aasd  2021-06-13
12    85655  G-00001    4   NaN   48          4   aasd  2021-06-16
13      NaN      b65  685   NaN  904      129.8  aabsd         NaN
14      NaN      b66  694   NaN  971  96.714286  aabsd         NaN

>>> output["G-00002"]
        id1       ID col1  Goal  col2        col3   Name        Date
0     85643  G-00001  671   NaN   793         500   aasd  2021-06-13
1     85644  G-00001    2  56.0     4          22   aasd  2021-06-13
2     85643  G-00002    5  78.0     8          89  aabsd  2021-06-14
3   8564312  G-00002    3   NaN    32          33  aabsd  2021-06-13
4   8564314  G-00002    4  89.0    43          44  aabsd  2021-06-14
5     85645  G-00001    5  73.0    55          55   aasd  2021-06-15
6   8564316  G-00002   60   NaN   610          60  aabsd  2021-06-15
7     85646  G-00001    0   NaN     0           1   aasd  2021-06-13
8   8564318  G-00002    0   NaN     0           5  aabsd  2021-06-16
9     85647  G-00001    6   NaN    16           6   aasd  2021-06-13
10    85648  G-00001    3   NaN    23           3   aasd  2021-06-13
11    85649  G-00001    2  34.0    72           2   aasd  2021-06-13
12    85655  G-00001    4   NaN    48           4   aasd  2021-06-16
13      NaN      b67  743   NaN  1486  121.833333  aabsd         NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM