简体   繁体   English

我应该将这些数据点视为异常值吗?

[英]Should I treat these data points as outliers?

Currently, I am building my analytics portfolio as part of the Google Data Analytics course.目前,我正在构建我的分析组合,作为 Google 数据分析课程的一部分。 I chose the option to analyze Divvy Bike Sharing data for the year 2021. But now I'm currently stuck in the part where I need to identify outliers in the dataset.我选择了分析 2021 年 Divvy Bike Sharing 数据的选项。但现在我陷入了需要识别数据集中异常值的部分。 I'm focusing on the 'ride_length' column which shows the duration of each ride and I'm using two methods which are:我专注于显示每次骑行持续时间的“ride_length”列,我使用的是两种方法:

  1. IQR (data points that fall below 25th or above 75th percentile are outliers) IQR(低于第 25 个或高于第 75 个百分位数的数据点是异常值)
  2. 1% and 99% rule (data points that fall below 1% percentile or above 99% percentile are outliers) 1% 和 99% 规则(低于 1% 百分位数或高于 99% 百分位数的数据点是异常值)

Note: the ride_length column is counted in minutes注意:ride_length 列以分钟计算

A) IQR METHOD A) IQR 方法

The first method that I use to detect outliers is the IQR proximity rule (The data points which fall below the 25th percentile or above the 75th percentile are outliers).我用来检测异常值的第一种方法是 IQR 邻近规则(低于第 25 个百分位或高于第 75 个百分位的数据点是异常值)。 Here's the code:这是代码:

lower_bound_iqr <- quantile(df_2021_test$ride_length, 0.25) upper_bound_iqr <- quantile(df_2021_test$ride_length, 0.75)

lower_bound_iqr
25% 
6.98 

upper_bound_iqr
75% 
20.98 

Key takeaways:要点:

  1. ride_length that falls below 6.98 minutes is considered an outlier低于 6.98 分钟的ride_length被视为异常值
  2. ride_length that falls above 20.98 minutes is considered an outlier超过 20.98 分钟的ride_length被认为是异常值

Then I count the percentages of outliers in the data:然后我计算数据中异常值的百分比:

outliers_iqr <- which(df_2021_test$ride_length < lower_bound_iqr | df_2021_test$ride_length > upper_bound_iqr)
(count(df_2021_test[outliers_iqr, ]) / count(df_2021_test)) * 100 

         n
1 49.89316

The result is that 49.89 % of data are considered outliers .结果是49.89% 的数据被认为是异常值 I think this is too much data to exclude for the analysis to begin as it will reduce the accuracy of the analysis.我认为要开始分析要排除的数据太多,因为它会降低分析的准确性。 Or am I wrong?还是我错了? Therefore I move to the second method因此我转向第二种方法

B) 1% and 99% Percentile Rule B) 1% 和 99% 百分位数规则

This method state that data points that are far from the 99% percentile and less than 1% percentile are considered an outlier.此方法 state 将远离 99% 百分位且小于 1% 百分位的数据点视为异常值。 Here's the code:这是代码:

lower_bound <- quantile(df_2021_test$ride_length, 0.01) upper_bound <- quantile(df_2021_test$ride_length, 0.99)

lower_bound
1% 
1.82 
upper_bound
99% 
115.63 

Key takeaways:要点:

  1. ride_length that falls below 1.82 minutes are considered outliers低于 1.82 分钟的ride_length被视为异常值
  2. ride_length that falls above 115.63 minutes (approx. 2 hours) are considered outliers超过ride_length分钟(约 2 小时)的 ride_length 被视为异常值

Again, I count the percentages of outliers in the data:同样,我计算数据中异常值的百分比:

outliers <- which(df_2021_test$ride_length < lower_bound | df_2021_test$ride_length > upper_bound)
(count(df_2021_test[outliers, ]) / count(df_2021_test)) * 100

         n
1 1.982182

The result is that 1.98 % of data are considered outliers .结果是1.98% 的数据被认为是异常值 I think this is fine to exclude for the analysis to begin as it will not reduce the accuracy of the analysis that much.我认为排除分析开始是很好的,因为它不会降低分析的准确性。 Or am I wrong?还是我错了?

Here are my questions:这是我的问题:

  1. When identifying outliers in the data, what should you choose between the two of the method above?在识别数据中的异常值时,上述两种方法应该如何选择? Or is there another better method?或者还有其他更好的方法吗?
  2. Is my way of identifying outliers in the dataset correct?我识别数据集中异常值的方法是否正确? Or am I missing something?或者我错过了什么?

I have detailed all of my steps to identify outliers above and again, it's not an error in the code it's just that I'm confused as to whether my method of identifying them is correct or if is there any better way or something that I miss.我已经详细说明了我一遍又一遍地识别异常值的所有步骤,这不是代码中的错误,只是我对我识别它们的方法是否正确或者是否有更好的方法或我想念的东西感到困惑.

It should be obvious that 50% of your data will fall above the 75th or below the 25th centile, and that 2% is either above the 99th or below the 1st.很明显,50% 的数据将落在第 75 个百分位数以上或第 25 个百分位数以下,而 2% 的数据将在第 99 个百分位数以上或第 1 个百分位数以下。

There is no data-driven answer to 'what is an outlier'. “什么是异常值”没有数据驱动的答案。 Data points might be far away from the rest of a distribution for many different reasons, and how you identify these and what you do with them should depend on why you think they occurred and what downstream analysis you are planning (ultimately what question you are asking of the data).由于许多不同的原因,数据点可能远离分布的 rest,您如何识别这些数据点以及如何处理它们应该取决于您认为它们发生的原因以及您正在计划的下游分析(最终是您要问的问题的数据)。

If you can specify these things you'll have a clearer idea of how to identify and handle outliers.如果您可以指定这些内容,您将对如何识别和处理异常值有更清晰的认识。 Do not rely on simple data-driven rules.不要依赖简单的数据驱动规则。

George is correct.乔治是对的。 Rarely do data-driven "rules" behind outlier detection/removal work alone.很少单独执行异常值检测/删除工作背后的数据驱动“规则”。 As an example using the starwars dataset in R, I have plotted the heights and masses of the Star Wars movie characters below:作为使用 R 中的starwars数据集的示例,我在下面绘制了星球大战电影角色的高度和质量:

#### Load Library ####
library(tidyverse)
theme_set(theme_bw())

#### Plot Obvious Outlier ####
starwars %>% 
  ggplot(aes(x=mass,
             y=height))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(x="Mass",
       y="Height",
       title = "Star Wars: Mass x Height")

You can see a very obvious outlier.你可以看到一个非常明显的异常值。 This is Jabba the Hut, whose mass is several leagues above the others in this data.这是小屋贾巴,在此数据中,他的质量比其他人高出好几个里格。 Here it is reasonable to assume that it greatly affects the regression plotted and doesn't model this relationship in a very accurate way.在这里可以合理地假设它极大地影响绘制的回归并且不会以非常准确的方式 model 这种关系。

在此处输入图像描述

We can remove it simply by filtering it with this code:我们可以简单地通过使用以下代码过滤它来删除它:

#### Remove Outlier and Plot ####
starwars %>% 
  filter(mass < 1000) %>% 
  ggplot(aes(x=mass,
             y=height))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(x="Mass",
       y="Height",
       title = "Star Wars: Mass x Height")

在此处输入图像描述

Now let's say we didn't have this very obvious outlier and we already started with this particular subset of the data.现在假设我们没有这个非常明显的异常值,并且我们已经开始使用这个特定的数据子集。 Then after, we tried to flag the outliers using an often used 1.5xx IQR method to decide what to remove:然后,我们尝试使用常用的 1.5xx IQR 方法来标记异常值,以确定要删除的内容:

#### Highlight Outliers in Plot ####
starwars %>% 
  filter(mass < 1000) %>% 
  mutate(outlier.height = rstatix::is_outlier(height),
         outlier.mass = rstatix::is_outlier(mass)) %>% 
  ggplot(aes(x=mass,
             y=height))+
  geom_point(aes(color=outlier.height))+
  geom_smooth(method = "lm")+
  labs(x="Mass",
       y="Height",
       title = "Star Wars: Mass x Height",
       color = "Height Outlier?")

You will notice that not only are a substantial amount of outliers now present in the data, but shaving them would fundamentally alter the real associations in the data:您会注意到,现在不仅数据中存在大量异常值,而且去除它们会从根本上改变数据中的真实关联:

在此处输入图像描述

Therefore, it is best to decide what is a meaningful inclusion/disclusion of an outlier.因此,最好决定什么是有意义的异常值包含/排除。 In the case of Jabba the Hut, even his separation from the pack has questionable outcomes (there may be others from his species who are also as large, and we may want to predict that in some way).在小屋贾巴的例子中,即使他与狼群的分离也会产生可疑的结果(他的物种中可能有其他体型也一样大,我们可能想以某种方式预测这一点)。 With this being the case, outliers require more thought than a quantile rule.在这种情况下,离群值需要比分位数规则更多的思考。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM