I have a large data frame with over a million rows where I would like to drop any row that does not contain all unique values within the row itself.
0 1 2 4 3
0 13 3 2 0 3 # Want to drop
1 13 72 2 13 1 # Want to drop
2 13 3 2 8 5
Is there a faster way of achieving the same result as the code below?
df[df.apply(lambda x: x.is_unique, axis=1)]
# 0 1 2 4 3
# 2 13 3 2 8 5
Numpy is known to operate significantly faster than Pandas .
So attempt the following code:
nCol = df.shape[1]
df[np.apply_along_axis(lambda row: np.unique(row).size == nCol, 1, df.values)]
My comparison of execution time, using %timeit , indicates that my code is about 3 times faster than yours.
For bigger source DataFrame this difference can be greater. Check on your own and than pass the result in a comment.
By the way: I checked also solution proposed by enke , but it seems to be slower than your code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.