[英]How do I remove outliers from a pandas DataFrame that has both numerical and non-numerical data
I have a dataframe (cgf) that looks as follows and I want to remove the outliers for only the numerical columns:我有一个 dataframe (cgf),如下所示,我想仅删除数字列的异常值:
Product object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Product 180 non-null object
1 Age 180 non-null int64
2 Gender 180 non-null object
3 Education 180 non-null category
4 MaritalStatus 180 non-null object
5 Usage 180 non-null int64
6 Fitness 180 non-null category
7 Income 180 non-null int64
8 Miles 180 non-null int64
dtypes: category(2), int64(4), object(3)
I tried several scripts using z-score and IQR methods, but none of them worked.我使用 z-score 和 IQR 方法尝试了几个脚本,但没有一个起作用。 For example, here is a script for the z-score that didn't work
例如,这是一个无效的 z 分数脚本
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(cgf)) # get the z-score of every value with respect to their columns
print(z)
I get this error我收到这个错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-102-2759aa3fbd60> in <module>
----> 1 z = np.abs(stats.zscore(cgf)) # get the z-score of every value with respect to their columns
2 print(z)
~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
Here is the IQR method I tried, but it also failed as follows:这是我尝试的IQR方法,但它也失败了,如下所示:
np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))
error message:错误信息:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-96-bb3dfd2ce6c5> in <module>
----> 1 np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))
~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in f(self, other)
702
703 # See GH#4537 for discussion of scalar op behavior
--> 704 new_data = dispatch_to_series(self, other, op, axis=axis)
705 return self._construct_result(new_data)
706
~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in dispatch_to_series(left, right, func, axis)
273 # _frame_arith_method_with_reindex
274
--> 275 bm = left._mgr.operate_blockwise(right._mgr, array_op)
276 return type(left)(bm)
277
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in operate_blockwise(self, other, array_op)
362 Apply array_op blockwise with another (aligned) BlockManager.
363 """
--> 364 return operate_blockwise(self, other, array_op)
365
366 def apply(self: T, f, align_keys=None, **kwargs) -> T:
~\anaconda3\lib\site-packages\pandas\core\internals\ops.py in operate_blockwise(left, right, array_op)
36 lvals, rvals = _get_same_shape_values(blk, rblk, left_ea, right_ea)
37
---> 38 res_values = array_op(lvals, rvals)
39 if left_ea and not right_ea and hasattr(res_values, "reshape"):
40 res_values = res_values.reshape(1, -1)
~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
228 if should_extension_dispatch(lvalues, rvalues):
229 # Call the method on lvalues
--> 230 res_values = op(lvalues, rvalues)
231
232 elif is_scalar(rvalues) and isna(rvalues):
~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
63 other = item_from_zerodim(other)
64
---> 65 return method(self, other)
66
67 return new_method
~\anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in func(self, other)
74 if not self.ordered:
75 if opname in ["__lt__", "__gt__", "__le__", "__ge__"]:
---> 76 raise TypeError(
77 "Unordered Categoricals can only compare equality or not"
78 )
TypeError: Unordered Categoricals can only compare equality or not
How do I resolve some of these errors?如何解决其中一些错误? It appears the combination of categorical and numerical data in my df is causing a problem, but I am a newbie and I don't know how to fix it so that I can remove outliers
看来我的df中的分类和数字数据的组合导致了一个问题,但我是一个新手,我不知道如何解决它,以便我可以删除异常值
For example, if you're dropping outliers in the 'Age' column, then the changes happened in this column will get reflected in the data frame.例如,如果您在“年龄”列中删除异常值,则此列中发生的更改将反映在数据框中。 ie, that entire row will be dropped.
即,整行将被删除。
Reference: towardsdatascience参考: 向数据科学
Reference: how-to-remove-outliers参考: 如何去除异常值
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.