简体   繁体   English

r从data.frames列表中删除异常值,并创建一个data.frames新列表?

[英]r remove outliers from a list of data.frames and make a new list of data.frames?

I have a List of 6 in a data.frame 我在data.frame中有一个清单6

It has 3 columns: 它包含3列:

id, T_C, Sales id,T_C,销售

T_C is TEST or CONTROL T_C是TEST还是CONTROL

Someone helped me here and I learned how to find the mean() and sd() by looping, instead of doing individual statements. 有人在这里为我提供了帮助,我学会了如何通过循环而不是单独的语句来查找mean()和sd()。

Now my goal is to remove the outliers from the 6 lists and produce a List of 6 (after removing outliers). 现在,我的目标是从6个列表中删除异常值,并生成6个列表(在删除异常值之后)。

str(dfList) # this is the list of 6 in data.frames str(dfList)#这是data.frames中6的列表

I am able to get the mean() and sd() of each list like this: 我能够像这样获得每个列表的mean()和sd():

list_mean_sd <- lapply(dfList,
                       function(df) 
                        {
                         df %>%
                           group_by(TC_INDICATOR) %>%
                           summarise(mean = mean(NET_SPEND),
                                     sd = sd(NET_SPEND))
                        })

> str(list_mean_sd)
List of 6  (1 obs. of  2 variables:)

I can selected them individually for mean or sd: 我可以分别选择它们作为均值或标准差:

sapply(list_mean_sd, "[", "mean")
sapply(list_mean_sd, "[", "sd")

Basically, my goal is to id the outliers and remove them, product an alternative set, or after-set. 基本上,我的目标是找出异常值并删除它们,生成替代集或后置集。

**outliers are:  mean - 3*sd()  or  mean + 3*sd()

I have this done, but with more manually steps, looking to learn how to loop through these sets and stuff like that, thanks in advance for helping me! 我已经做到了,但是需要更多的手动步骤,希望了解如何遍历这些集合和类似内容,在此先感谢您的帮助!

Give this a shot. 试一下。 First I create data which I split into six data frames which are housed in a list. 首先,我创建数据,并将其分为六个数据帧,这些数据帧存储在一个列表中。

set.seed(0)
test_data <- data.frame(id = 1:10000, 
                        T_C = sample(c(TRUE, FALSE), size = 10000, replace = TRUE),
                        Sales = rnorm(n = 10000),
                        grp = sample(c("a", "b", "c", "d", "e", "f"), 
                                     size = 10000, replace = TRUE))

test_split <- split(test_data, test_data$grp)

Then, I use lapply on this list to identify what I'm calling the z_scores which are computed as the difference between the mean of Sales and each individual Sales divided by the sd of Sales . 然后,我用lapply这个名单上找出什么我打电话z_scores被计算为之间的差异meanSales和每个个体Sales由分割sdSales Finally, we use filter on these to pull out the ones which have a z_score with an absolute value over 3. 最后,我们对它们使用过滤器,以提取z_score的绝对值超过3的对象。

library(dplyr)
outlier_list <- lapply(test_split, 
       function(m) group_by(m, T_C) %>% mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
         ungroup() %>% filter(abs(z_score) >= 3)
)

> outlier_list
$a
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   468  TRUE -2.995332      a -3.073314
2  3026  TRUE  3.028495      a  3.075258
3  5188  TRUE -3.097847      a -3.177952
4  7993 FALSE -3.571076      a -3.823983
5  9105  TRUE -3.216710      a -3.299276

$b
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   264  TRUE  3.003494      b  3.003329
2  2172  TRUE  3.001475      b  3.001326
3  2980 FALSE -3.176356      b -3.222782
4  3366 FALSE  3.009292      b  3.048559
5  7477 FALSE  3.348301      b  3.392265
6  7583  TRUE -3.089758      b -3.040911

$c
# A tibble: 2 × 5
     id   T_C    Sales    grp  z_score
  <int> <lgl>    <dbl> <fctr>    <dbl>
1  8078  TRUE 3.015343      c 3.129923
2  8991 FALSE 3.113526      c 3.058302

$d
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   544  TRUE  3.289070      d  3.168235
2  3791 FALSE  3.791938      d  3.769810
3  6771 FALSE -3.157741      d -3.166861
4  7864  TRUE  3.164128      d  3.045728
5  9371  TRUE -3.026884      d -3.024655

$e
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   186 FALSE  3.021541      e  3.046079
2  1211  TRUE  3.414337      e  3.343521
3  1665  TRUE  3.546282      e  3.473614
4  3765 FALSE  3.363641      e  3.391142
5  4172  TRUE  3.348820      e  3.278923
6  7973 FALSE -2.987790      e -3.015284

$f
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1  1089  TRUE -3.195090      f -3.189979
2  2452 FALSE  3.287591      f  3.212317
3  3486 FALSE -3.334942      f -3.367962
4  4198 FALSE -3.102578      f -3.137082
5  8183  TRUE  3.081077      f  3.075324
6  8656  TRUE  3.253873      f  3.247822

Obviously, this will give you only the outliers. 显然,这只会给您异常值。 If you want to keep only the inliers, you change the >= 3 to a < 3 . 如果只想保留inlier,则可以将>= 3更改为< 3

Updated to get Wilcox test on inliers 已更新,可对内部像素进行Wilcox测试

inlier_list <- lapply(test_split, 
                       function(m) group_by(m, T_C) %>% 
                        mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
                         ungroup() %>% filter(abs(z_score) < 3)
)

We just run lapply on the list of inliers using the parameters noted in OP's comment. 我们只是使用OP注释中指出的参数在内部列表上运行lapply

wilcox_test_res <- lapply(inlier_list, 
                          function(m) wilcox.test(m$Sales ~ m$T_C, 
                                                  mu= mean(m$Sales[m$T_C == TRUE]), 
                                                  conf.level=0.95,

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM