简体   繁体   English

如何从同时具有数字和非数字数据的 pandas DataFrame 中删除异常值

[英]How do I remove outliers from a pandas DataFrame that has both numerical and non-numerical data

I have a dataframe (cgf) that looks as follows and I want to remove the outliers for only the numerical columns:我有一个 dataframe (cgf),如下所示,我想仅删除数字列的异常值:

    Product          object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Product        180 non-null    object  
 1   Age            180 non-null    int64   
 2   Gender         180 non-null    object  
 3   Education      180 non-null    category
 4   MaritalStatus  180 non-null    object  
 5   Usage          180 non-null    int64   
 6   Fitness        180 non-null    category
 7   Income         180 non-null    int64   
 8   Miles          180 non-null    int64   
dtypes: category(2), int64(4), object(3)

I tried several scripts using z-score and IQR methods, but none of them worked.我使用 z-score 和 IQR 方法尝试了几个脚本,但没有一个起作用。 For example, here is a script for the z-score that didn't work例如,这是一个无效的 z 分数脚本

from scipy import stats
import numpy as np
z = np.abs(stats.zscore(cgf))   # get the z-score of every value with respect to their columns
print(z)

I get this error我收到这个错误

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-102-2759aa3fbd60> in <module>
----> 1 z = np.abs(stats.zscore(cgf))   # get the z-score of every value with respect to their columns
      2 print(z)

~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_policy)
   2495         sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
   2496     else:
-> 2497         mns = a.mean(axis=axis, keepdims=True)
   2498         sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
   2499 

~\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
    160     ret = umr_sum(arr, axis, dtype, out, keepdims)
    161     if isinstance(ret, mu.ndarray):
--> 162         ret = um.true_divide(
    163                 ret, rcount, out=ret, casting='unsafe', subok=False)
    164         if is_float16_result and out is None:

TypeError: unsupported operand type(s) for /: 'str' and 'int'

Here is the IQR method I tried, but it also failed as follows:这是我尝试的IQR方法,但它也失败了,如下所示:

np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))

error message:错误信息:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-96-bb3dfd2ce6c5> in <module>
----> 1 np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))

~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in f(self, other)
    702 
    703         # See GH#4537 for discussion of scalar op behavior
--> 704         new_data = dispatch_to_series(self, other, op, axis=axis)
    705         return self._construct_result(new_data)
    706 

~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in dispatch_to_series(left, right, func, axis)
    273         #  _frame_arith_method_with_reindex
    274 
--> 275         bm = left._mgr.operate_blockwise(right._mgr, array_op)
    276         return type(left)(bm)
    277 

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in operate_blockwise(self, other, array_op)
    362         Apply array_op blockwise with another (aligned) BlockManager.
    363         """
--> 364         return operate_blockwise(self, other, array_op)
    365 
    366     def apply(self: T, f, align_keys=None, **kwargs) -> T:

~\anaconda3\lib\site-packages\pandas\core\internals\ops.py in operate_blockwise(left, right, array_op)
     36             lvals, rvals = _get_same_shape_values(blk, rblk, left_ea, right_ea)
     37 
---> 38             res_values = array_op(lvals, rvals)
     39             if left_ea and not right_ea and hasattr(res_values, "reshape"):
     40                 res_values = res_values.reshape(1, -1)

~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
    228     if should_extension_dispatch(lvalues, rvalues):
    229         # Call the method on lvalues
--> 230         res_values = op(lvalues, rvalues)
    231 
    232     elif is_scalar(rvalues) and isna(rvalues):

~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
     63         other = item_from_zerodim(other)
     64 
---> 65         return method(self, other)
     66 
     67     return new_method

~\anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in func(self, other)
     74         if not self.ordered:
     75             if opname in ["__lt__", "__gt__", "__le__", "__ge__"]:
---> 76                 raise TypeError(
     77                     "Unordered Categoricals can only compare equality or not"
     78                 )

TypeError: Unordered Categoricals can only compare equality or not

How do I resolve some of these errors?如何解决其中一些错误? It appears the combination of categorical and numerical data in my df is causing a problem, but I am a newbie and I don't know how to fix it so that I can remove outliers看来我的df中的分类和数字数据的组合导致了一个问题,但我是一个新手,我不知道如何解决它,以便我可以删除异常值

For example, if you're dropping outliers in the 'Age' column, then the changes happened in this column will get reflected in the data frame.例如,如果您在“年龄”列中删除异常值,则此列中发生的更改将反映在数据框中。 ie, that entire row will be dropped.即,整行将被删除。

在此处输入图像描述

Reference: towardsdatascience参考: 向数据科学

Reference: how-to-remove-outliers参考: 如何去除异常值

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM