[英]In pandas data frame how to remove some summarized duplicates rows
我有以下 pandas 数据框。
d = {'id1': ['85643', '85644','85643','8564312','8564314','85645','8564316','85646','8564318','85647','85648','85649','85655'],'ID': ['G-00001', 'G-00001','G-00002','G-00002','G-00002','G-00001','G-00002','G-00001','G-00002','G-00001','G-00001','G-00001','G-00001'],'col1': [671, 2,5,3,4,5,60,0,0,6,3,2,4],'Goal': [np.nan, 56,78,np.nan,89,73,np.nan ,np.nan ,np.nan, np.nan, np.nan, 34,np.nan ], 'col2': [793, 4,8,32,43,55,610,0,0,16,23,72,48],'col3': [500, 22,89,33,44,55,60,1,5,6,3,2,4],'Name': ['aasd', 'aasd','aabsd','aabsd','aabsd','aasd','aabsd','aasd','aabsd','aasd','aasd','aasd','aasd'],'Date': ['2021-06-13', '2021-06-13','2021-06-14','2021-06-13','2021-06-14','2021-06-15','2021-06-15','2021-06-13','2021-06-16','2021-06-13','2021-06-13','2021-06-13','2021-06-16']}
dff = pd.DataFrame(data=d)
dff
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
我想总结一些列,并根据“id1”列中的一些 id 将它们添加回同一个数据框。 然后,当我们添加该行时,我想给“ID”列一个新名称。
基于下面的“id1”列ID,我只想总结“col1”、“col2”、“col3”和“Name”列。 然后我想将该行添加回同一个数据框,并为“ID”列提供一个新的 id。 但是,我想通过“ID”列进行计算。 (如群比)
在下面的 function 中,我将 col1、col2 和 col3 的总和与平均值相加。
ID_list = ['G-00001','G-00002']
def sumarizeValues(Filter,Orginal):
b65 = ['85643','85645', '85655','85646'] # for G-00001
b66 = ['85643','85645','85647','85648','85649','85644'] # for G-00001
b67 = ['85643','8564312','8564314','8564316','8564318'] # for G-00002
# create a dictionary
d_map = {'b65': b65, 'b66': b66, 'b67': b67}
# dictionary comprehension
df = pd.DataFrame({k: dff[dff['id1'].isin(v)].agg({'col1': sum, 'col2': sum,
'col3': 'mean', 'Name': min})
for k,v in d_map.items()}).T.reset_index()
# rename the columns
df = df.rename(columns={'index': 'ID'})
# concat the two frames
#pd.concat([dff, df]).reset_index(drop=True)
Orginal = pd.concat([Orginal, df]).reset_index(drop=True)
return Orginal
如果 ID 在切片中有值,我只想创建一个汇总行。 例如,首先在 ID_list 中,我使用“G-00001”并基于 id1 切片器(b65,b66,b67)创建汇总行。 但是,我创建的 function 给了我一些额外的行,如下所示。 我怎样才能消除这些行?
所以最终的数据框如下所示
ID_list = ['G-00001','G-00002']
def abcFunction(dff):
for ID in ID_list:
print(ID)
IDlist =[ID]
print(IDlist)
Filter = dff[dff['ID'].isin(IDlist)]
dff = sumarizeValues(Filter,dff)
print(dff)
## calculation
## calculation
## calculation
abcFunction(dff)
# So for the first ID (G-00001), I actually don't need the last row(15th index containing b67).
# I only need that row for the G-00002 calcualtion.
G-00001
['G-00001']
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8000 aabsd NaN
14 NaN b66 694 NaN 971 96.7143 aabsd NaN
15 NaN b67 743 NaN 1486 121.8333 aabsd NaN
# When I ran it for G-00002, it actually contains all the other rows created.
#So for the second ID (G-00002), I actually don't need the row(13th index to 17th index
#containing b65,b66, and b67 in index 15). Because G-00002 doesn't
#contain any values in b65,b66.
G-00002
['G-00002']
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0000 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0000 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0000 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0000 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0000 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8000 aabsd NaN
14 NaN b66 694 NaN 971 96.7143 aabsd NaN
15 NaN b67 743 NaN 1486 121.8333 aabsd NaN
16 NaN b65 685 NaN 904 129.8000 aabsd NaN
17 NaN b66 694 NaN 971 96.7143 aabsd NaN
18 NaN b67 743 NaN 1486 121.8333 aabsd NaN
有可能这样做吗? 任何帮助表示赞赏! 提前致谢!
更改映射字典,以便将每个 ID 映射到所需的新 ID。 尝试:
def summarizeValues(df, ID):
mapper = {"G-00001": {"b65": ['85643', '85645', '85655','85646'],
"b66": ['85643', '85645', '85647', '85648', '85649', '85644']},
"G-00002": {"b67": ['85643', '8564312', '8564314', '8564316', '8564318']}
}
# dictionary comprehension
dff = pd.DataFrame({k: df[df['id1'].isin(v)].agg({'col1': sum,
'col2': sum,
'col3': 'mean',
'Name': min})
for k,v in mapper[ID].items()})
dff = dff.T.reset_index().rename(columns={'index': 'ID'})
return pd.concat([df, dff]).reset_index(drop=True)
output = dict()
for ID in ['G-00001','G-00002']:
output[ID] = summarizeValues(df, ID)
>>> output["G-00001"]
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b65 685 NaN 904 129.8 aabsd NaN
14 NaN b66 694 NaN 971 96.714286 aabsd NaN
>>> output["G-00002"]
id1 ID col1 Goal col2 col3 Name Date
0 85643 G-00001 671 NaN 793 500 aasd 2021-06-13
1 85644 G-00001 2 56.0 4 22 aasd 2021-06-13
2 85643 G-00002 5 78.0 8 89 aabsd 2021-06-14
3 8564312 G-00002 3 NaN 32 33 aabsd 2021-06-13
4 8564314 G-00002 4 89.0 43 44 aabsd 2021-06-14
5 85645 G-00001 5 73.0 55 55 aasd 2021-06-15
6 8564316 G-00002 60 NaN 610 60 aabsd 2021-06-15
7 85646 G-00001 0 NaN 0 1 aasd 2021-06-13
8 8564318 G-00002 0 NaN 0 5 aabsd 2021-06-16
9 85647 G-00001 6 NaN 16 6 aasd 2021-06-13
10 85648 G-00001 3 NaN 23 3 aasd 2021-06-13
11 85649 G-00001 2 34.0 72 2 aasd 2021-06-13
12 85655 G-00001 4 NaN 48 4 aasd 2021-06-16
13 NaN b67 743 NaN 1486 121.833333 aabsd NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.