[英]Filter rows from a data frame based on the highest index and values from column
I have the following example: I would like to keep all the rows where ID=5
and where I have multiple rows with ID=3
I would like to keep only from them the ones with the highest index.我有以下示例:我想保留ID=5
所有行以及ID=3
多行我只想保留索引最高的行。
data = {'Profession':['Teacher', 'Banker', 'Teacher', 'Judge','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Male','Male','Female'],'Size':['M','M','L','S','S','M'],'ID':['5','3','3','3','5','3']}
data2={'Profession':['Doctor', 'Scientist', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Male','Male', 'Female','Female','Male','Male'],'Size':['L','M','L','M','L','L'],'ID':['5','3','5','3','3','3']}
data3 = {'Profession':['Banker', 'Banker', 'Doctor', 'Doctor','lawyer','Teacher'], 'Gender':['Male','Male', 'Female', 'Female','Female','Male'],'Size':['S','M','S','M','L','S'],'ID':['5','3','3','3','5','3']}
data4={'Profession':['Judge', 'Judge', 'Scientist', 'Banker','Judge','Scientist'], 'Gender':['Female','Female', 'Female','Female','Female','Female'],'Size':['M','S','L','S','M','S'],'ID':['3','5','3','3','5','3']}
df =pd.DataFrame(data)
df2=pd.DataFrame(data2)
df3=pd.DataFrame(data3)
df4=pd.DataFrame(data4)
DATA=pd.concat([df,df2,df3,df4])
DATA.reset_index(drop=True,inplace=True)
DATA
I want this : This is just an example.我想要这个:这只是一个例子。 In my real data I have really huge number of rows so I would like to have a piece of code which works for larger data frames.在我的真实数据中,我有非常多的行,所以我想要一段适用于更大数据帧的代码。
You can construct a boolean which gets the following IDs with 3
but leaves the first.您可以构造一个布尔值,它使用3
获取以下 ID,但保留第一个。
The bool is testing that布尔正在测试
3
该行等于3
3
这些真值上方的行也等于3
if we look at the first few rows with a conditional column with this boolean -如果我们查看带有此布尔值的条件列的前几行 -
Profession Gender Size ID bool_
0 Teacher Male M 5 False
1 Banker Male M 3 False <-- fulfills 1st condition but not 2nd so false.
2 Teacher Female L 3 True <-- fulfills condition 1 & 2
3 Judge Male S 3 True <-- fulfills condition 1 & 2
4 lawyer Male S 5 False
5 Teacher Female M 3 False
#df = DATA
#df['ID'] = df['ID'].astype(int)
m = df['ID'].eq(3) & df['ID'].eq(df['ID'].shift())
df_new = df[~m]
Profession Gender Size ID
0 Teacher Male M 5.0
1 Banker Male M 3.0
4 lawyer Male S 5.0
5 Teacher Female M 3.0
6 Doctor Male L 5.0
7 Scientist Male M 3.0
8 Scientist Female L 5.0
9 Banker Female M 3.0
12 Banker Male S 5.0
13 Banker Male M 3.0
16 lawyer Female L 5.0
17 Teacher Male S 3.0
19 Judge Female S 5.0
20 Scientist Female L 3.0
22 Judge Female M 5.0
23 Scientist Female S 3.0
Use:用:
data_filtered = DATA.loc[~(DATA['ID'].ne(DATA['ID'].shift()).cumsum().duplicated() &
DATA['ID'].eq('3')), :]
print(data_filtered)
Profession Gender Size ID
0 Teacher Male M 5
1 Banker Male M 3
4 lawyer Male S 5
5 Teacher Female M 3
6 Doctor Male L 5
7 Scientist Male M 3
8 Scientist Female L 5
9 Banker Female M 3
12 Banker Male S 5
13 Banker Male M 3
16 lawyer Female L 5
17 Teacher Male S 3
19 Judge Female S 5
20 Scientist Female L 3
22 Judge Female M 5
23 Scientist Female S 3
You could use ~m
of @Manakin answer:您可以使用@Manakin 的~m
答案:
DATA.loc[~m, :]
#Double boolean, filter
DATA[DATA.ID.eq('3')&DATA.ID.shift().eq('5')|DATA.ID.eq('5')]
Profession Gender Size ID
0 Teacher Male M 5
1 Banker Male M 3
4 lawyer Male S 5
5 Teacher Female M 3
6 Doctor Male L 5
7 Scientist Male M 3
8 Scientist Female L 5
9 Banker Female M 3
12 Banker Male S 5
13 Banker Male M 3
16 lawyer Female L 5
17 Teacher Male S 3
19 Judge Female S 5
20 Scientist Female L 3
22 Judge Female M 5
23 Scientist Female S 3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.