简体   繁体   中英

Is there a way to remove non unique rows in data frame without using apply?

I have a large data frame with over a million rows where I would like to drop any row that does not contain all unique values within the row itself.

    0   1   2   4   3
0   13  3   2   0   3 # Want to drop 
1   13  72  2   13  1 # Want to drop
2   13  3   2   8   5

Is there a faster way of achieving the same result as the code below?

df[df.apply(lambda x: x.is_unique, axis=1)]
#     0  1  2  4  3
# 2  13  3  2  8  5

Numpy is known to operate significantly faster than Pandas .

So attempt the following code:

nCol = df.shape[1]
df[np.apply_along_axis(lambda row: np.unique(row).size == nCol, 1, df.values)]

My comparison of execution time, using %timeit , indicates that my code is about 3 times faster than yours.

For bigger source DataFrame this difference can be greater. Check on your own and than pass the result in a comment.

By the way: I checked also solution proposed by enke , but it seems to be slower than your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM