Python: how to find outliers in a specific column in a dataframe

Question

I am trying to remove outliers from a specific column in my dataframe in Python. I found a solution from a few year old post that should work, but searches through the entire dataframe:

df_final[(np.abs(stats.zscore(df_final)) < 3).all(axis=1)]

Since my dataframe has different data types, such as dates, I am getting the following error when I run it

TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'

I feel like the solution to just get the outliers for a single column should be easy, but when I try

df_final[(np.abs(stats.zscore(df_final['rating'])) < 3).all(axis=1)]

to get the outliers of only the rating column, I get an error

AxisError: axis 1 is out of bounds for array of dimension 1

I know (think?) that this problem has to do with the array that is created, but I don't understand it well enough to find a solution. Can anyone better explain it to me?

EDIT: It seems that df_final[(np.abs(stats.zscore(df_final['rating'])) < 3)] works. Honestly not sure the reasoning behind it, so I'm still interested if anyone can explain or has a better solution.

Answer 1

np.abs(stats.zscore(df_final['rating'])) < 3

This line will return a numpy array, value is a series of True and False. This can be used to do slicing.

For numpy.all, please refer tho the doc . It is not used for your slicing purpose.

Python: how to find outliers in a specific column in a dataframe

Question

1 answers

solution1
1 2019-11-14 01:13:45

Python: how to find outliers in a specific column in a dataframe

Question

1 answers

solution1 1 2019-11-14 01:13:45

solution1
1 2019-11-14 01:13:45