简体   繁体   中英

Python: how to find outliers in a specific column in a dataframe

I am trying to remove outliers from a specific column in my dataframe in Python. I found a solution from a few year old post that should work, but searches through the entire dataframe:

df_final[(np.abs(stats.zscore(df_final)) < 3).all(axis=1)]

Since my dataframe has different data types, such as dates, I am getting the following error when I run it

TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

I feel like the solution to just get the outliers for a single column should be easy, but when I try

df_final[(np.abs(stats.zscore(df_final['rating'])) < 3).all(axis=1)]

to get the outliers of only the rating column, I get an error

AxisError: axis 1 is out of bounds for array of dimension 1

I know (think?) that this problem has to do with the array that is created, but I don't understand it well enough to find a solution. Can anyone better explain it to me?

EDIT: It seems that df_final[(np.abs(stats.zscore(df_final['rating'])) < 3)] works. Honestly not sure the reasoning behind it, so I'm still interested if anyone can explain or has a better solution.

np.abs(stats.zscore(df_final['rating'])) < 3

This line will return a numpy array, value is a series of True and False. This can be used to do slicing.

For numpy.all, please refer tho the doc . It is not used for your slicing purpose.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM