[英]In pandas data frame how to remove some summarized duplicates rows
I have the below pandas data frame.我有以下 pandas 数据框。
d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','G-00002','G-00001','G-00002','G-00001','G-00001','G-00001','G-00001'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4],'Name': ['aasd', 'aasd','aabsd','aabsd','aabsd','aasd','aabsd','aasd','aabsd','aasd','aasd','aasd','aasd'],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}
dff = pd.DataFrame(data=d)
dff
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
I want to summarize some of the columns and add them back to the same data frame based on some ids in the "id1" column.我想总结一些列,并根据“id1”列中的一些 id 将它们添加回同一个数据框。 Then I want to give a new name to the "ID" column when we add that row.然后,当我们添加该行时,我想给“ID”列一个新名称。
Based on below "id1" column ids I want to summarize only "col1","col2","col3",and "Name" columns.基于下面的“id1”列ID,我只想总结“col1”、“col2”、“col3”和“Name”列。 Then I want to add that row back to the same data frame and give a new id for the "ID" column.然后我想将该行添加回同一个数据框,并为“ID”列提供一个新的 id。 However, I want to do that by the "ID" column for my calculations.但是,我想通过“ID”列进行计算。 (Like groupby) (如群比)
In the below function I'm aggregating the sum for col1,col2, and col3 with average.在下面的 function 中,我将 col1、col2 和 col3 的总和与平均值相加。
ID_list = ['G-00001','G-00002']
def sumarizeValues(Filter,Orginal):
b65 = ['85643','85645', '85655','85646'] # for G-00001
b66 = ['85643','85645','85647','85648','85649','85644'] # for G-00001
b67 = ['85643','8564312','8564314','8564316','8564318'] # for G-00002
# create a dictionary
d_map = {'b65': b65, 'b66': b66, 'b67': b67}
# dictionary comprehension
df = pd.DataFrame({k: dff[dff['id1'].isin(v)].agg({'col1': sum, 'col2': sum,
'col3': 'mean', 'Name': min})
for k,v in d_map.items()}).T.reset_index()
# rename the columns
df = df.rename(columns={'index': 'ID'})
# concat the two frames
#pd.concat([dff, df]).reset_index(drop=True)
Orginal = pd.concat([Orginal, df]).reset_index(drop=True)
return Orginal
I only want to create a summarized row if the ID has values in slices.如果 ID 在切片中有值,我只想创建一个汇总行。 for example in the ID_list first I'm taking 'G-00001' and creating summarized rows based on id1 slicers(b65,b66,b67).例如,首先在 ID_list 中,我使用“G-00001”并基于 id1 切片器(b65,b66,b67)创建汇总行。 However, the function I created giving me some additional rows like below.但是,我创建的 function 给了我一些额外的行,如下所示。 How can I eliminate those rows?我怎样才能消除这些行?
So final data frame look like below所以最终的数据框如下所示
ID_list = ['G-00001','G-00002']
def abcFunction(dff):
for ID in ID_list:
print(ID)
IDlist =[ID]
print(IDlist)
Filter = dff[dff['ID'].isin(IDlist)]
dff = sumarizeValues(Filter,dff)
print(dff)
## calculation
## calculation
## calculation
abcFunction(dff)
# So for the first ID (G-00001), I actually don't need the last row(15th index containing b67).
# I only need that row for the G-00002 calcualtion.
G-00001
['G-00001']
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8000 aabsd NaN
14 NaN b66 694 NaN 971 96.7143 aabsd NaN
15 NaN b67 743 NaN 1486 121.8333 aabsd NaN
# When I ran it for G-00002, it actually contains all the other rows created.
#So for the second ID (G-00002), I actually don't need the row(13th index to 17th index
#containing b65,b66, and b67 in index 15). Because G-00002 doesn't
#contain any values in b65,b66.
G-00002
['G-00002']
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8000 aabsd NaN
14 NaN b66 694 NaN 971 96.7143 aabsd NaN
15 NaN b67 743 NaN 1486 121.8333 aabsd NaN
16 NaN b65 685 NaN 904 129.8000 aabsd NaN
17 NaN b66 694 NaN 971 96.7143 aabsd NaN
18 NaN b67 743 NaN 1486 121.8333 aabsd NaN
Is it possible to do that?有可能这样做吗? Any help is appreciated!任何帮助表示赞赏! Thanks in advance!提前致谢!
Change your mapping dictionary so each ID is mapped to the required new IDs.更改映射字典,以便将每个 ID 映射到所需的新 ID。 Try:尝试:
def summarizeValues(df, ID):
mapper = {"G-00001": {"b65": ['85643', '85645', '85655','85646'],
"b66": ['85643', '85645', '85647', '85648', '85649', '85644']},
"G-00002": {"b67": ['85643', '8564312', '8564314', '8564316', '8564318']}
}
# dictionary comprehension
dff = pd.DataFrame({k: df[df['id1'].isin(v)].agg({'col1': sum,
'col2': sum,
'col3': 'mean',
'Name': min})
for k,v in mapper[ID].items()})
dff = dff.T.reset_index().rename(columns={'index': 'ID'})
return pd.concat([df, dff]).reset_index(drop=True)
output = dict()
for ID in ['G-00001','G-00002']:
output[ID] = summarizeValues(df, ID)
>>> output["G-00001"]
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8 aabsd NaN
14 NaN b66 694 NaN 971 96.714286 aabsd NaN
>>> output["G-00002"]
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b67 743 NaN 1486 121.833333 aabsd NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.