I have seen similar questions but nothing answer mine. For example, I have a pandas data frame where the columns are 'A', 'B', 'C', 'D' and 'E'. First, I want to keep the rows if any of the 'A', 'B', 'C' and 'D' columns has different value. Also, if all the columns except 'E' is same, then I would like to keep the row where E is largest and drop the other rows. For instance we have 2(or more rows) where all 'A', 'B', 'C', 'D' columns are same but E is 10 for one and 12 for another row. So will keep the row that include 12 and drop the other one.
df = pd.DataFrame(np.random.randint(1,3,size=(10, 5)), columns=list('ABCDE'))
df
Out[3]:
A B C D E
0 2 2 1 2 2
1 1 2 1 2 2
2 2 1 2 1 2
3 1 2 1 1 1
4 1 2 1 2 2
5 1 2 2 1 1
6 2 2 2 2 2
7 1 1 1 2 2
8 2 1 1 2 2
9 1 1 1 2 1
# sort by column 'E', largest to smallest
df.sort_values(by=['E'], ascending=False)
Out[4]:
A B C D E
0 2 2 1 2 2
1 1 2 1 2 2
2 2 1 2 1 2
4 1 2 1 2 2
6 2 2 2 2 2
7 1 1 1 2 2
8 2 1 1 2 2
3 1 2 1 1 1
5 1 2 2 1 1
9 1 1 1 2 1
# drop all duplicate rows, using columns 'A', 'B', 'C', and 'D'
df.drop_duplicates(subset=['A', 'B', 'C', 'D'], keep='first')
Out[5]:
A B C D E
0 2 2 1 2 2
1 1 2 1 2 2
2 2 1 2 1 2
6 2 2 2 2 2
7 1 1 1 2 2
8 2 1 1 2 2
3 1 2 1 1 1
5 1 2 2 1 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.