[英]How do i remove outliers in a datset that has both categorical and numerical data?
I'm trying to remove outliers from the 'Price' column in a dataset.我正在尝试从数据集中的“价格”列中删除异常值。 I have been able to create a data frame of the outliers with their corresponding values in other columns but I'm struggling to exclude these entries from the parent dataset.
我已经能够使用其他列中的相应值创建异常值的数据框,但我正在努力从父数据集中排除这些条目。 How do i go about this?
我该怎么做?
this is the code i used to create the new dataframe stated above:这是我用来创建上述新数据框的代码:
lower_limit = pq1 - 1.5 *iqr
upper_limit = pq3 + 1.5 *iqr
newdf = df[((df['price'] < lower_limit) | (df['price'] > upper_limit))]
newdf
I tried using the tilde(~) sign before i specified the boolean operations but that didn't give the desired results.在指定布尔运算之前,我尝试使用波浪号(~)符号,但这没有给出预期的结果。
相反可以是:
newdf = df[((df['price'] > lower_limit) & (df['price'] < upper_limit))]
You could use the .loc
attribute to get a sample of your original dataframe that excludes the elements of the newdf
dataframe through the indeces:您可以使用
.loc
属性获取原始数据帧的样本,该样本通过newdf
排除newdf
数据帧的元素:
lower_limit = pq1 - 1.5 *iqr
upper_limit = pq3 + 1.5 *iqr
newdf = df[((df['price'] < lower_limit) | (df['price'] > upper_limit))]
df_not_outliers = df.loc[set(df.index) - set(newdf.index)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.