简体   繁体   English

我如何从异常值中清除数据集,因为它包含Python中的数字和分类变量?

[英]How can i clean my dataset from outliers as it includes numerical and categorical variables in Python?

I would like to clean my dataset from outliers but just in three specific columns, as the other 10 contain categorical variables. 我想从离群值中清除数据集,但只在三个特定的列中清除,因为其他10个列包含分类变量。 So how can I get my data cleaned by only referring to these specific columns? 那么,如何仅通过引用这些特定列来清理数据呢?

I'd like to use iqr range method. 我想使用iqr range方法。 That's the code i run so far: 那是我到目前为止运行的代码:

import numpy as np
def outliers(x): 
       return np.abs(x- x.median()) > 1.5*(x.quantile(.75)-x.quantile(0.25))
ath2.Age[outliers(ath2.Age)]
ath2.Height[outliers(ath2.Height)]
ath2.Weight[outliers(ath2.Weight)]

After checking the number of outliers in the columns I'm interested in, I don't know how to proceed further. 在检查了我感兴趣的列中的异常值之后,我不知道如何进一步进行。

If you want the code to be dynamic, you can 1st check the columns which are not categorical by below code: 如果您希望代码是动态的,则可以通过以下代码首先检查未归类的列:

cols = df.columns
num_cols = df._get_numeric_data().columns 
##num_cols will contains list of column names which are numeric
## In your case, it should come Age,Height etc.

Alternatively, you can also use include or exclude parameters using df.select_dtypes according to your dataframe 另外,您还可以根据数据df.select_dtypes使用df.select_dtypes来使用includeexclude参数

After this run below code from columns from above: 在此之后,从上方的列中运行以下代码:

df[np.abs(df.Data-df.Data.mean()) <= (3*df.Data.std())]  
## Df is the dataframe and Data is the name of the column. 
#In your case, it will be Age,Height etc.

OR 要么

If you want to make a new df with only the numerical columns and find out the outliers in one shot, below is the code: 如果要仅用数字列创建新的df并一次性找出离群值,则代码如下:

df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何标准化数据集中的数值变量? - How can I normalize the numerical variables in a dataset? 如何删除既有分类数据又有数值数据的数据集中的异常值? - How do i remove outliers in a datset that has both categorical and numerical data? 我如何理解列类型是数字还是数字分类? - How can i understand whether the column type is numerical or numerical categorical? Python - 传感器数据包含不需要的符号 - 如何获得干净的数据集? - Python - Sensor Data includes unwanted symbols - how to achieve a clean dataset? 如何使用Pandas数据框使我的Python程序并行化以清理包含2000个csv文件的数据集? - How can I parallize my Python program to clean a dataset with 2000 csv files using Pandas dataframes? 如何在数据集中找到没有分类列和数字列 - how to find no of categorical columns and numerical columns in dataset 我如何将包含字符串的分类特征转换为python中的数值 - How can i convert a categorical feature which contains strings to a numerical value in python 如何从同时具有数字和非数字数据的 pandas DataFrame 中删除异常值 - How do I remove outliers from a pandas DataFrame that has both numerical and non-numerical data 确定包含分类变量和数值变量的数据集的聚类算法 - Deciding to the clustering algorithm for the dataset containing both categorical and numerical variables 在将分类变量转换为虚拟变量后,如何从 sklearn api 中找到特征重要性? - How can I find Feature importance from sklearn api, after I have converted my categorical variables into dummy variables?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM