[英]r remove outliers from a list of data.frames and make a new list of data.frames?
I have a List of 6 in a data.frame 我在data.frame中有一个清单6
It has 3 columns: 它包含3列:
id, T_C, Sales id,T_C,销售
T_C is TEST or CONTROL T_C是TEST还是CONTROL
Someone helped me here and I learned how to find the mean() and sd() by looping, instead of doing individual statements. 有人在这里为我提供了帮助,我学会了如何通过循环而不是单独的语句来查找mean()和sd()。
Now my goal is to remove the outliers from the 6 lists and produce a List of 6 (after removing outliers). 现在,我的目标是从6个列表中删除异常值,并生成6个列表(在删除异常值之后)。
str(dfList) # this is the list of 6 in data.frames str(dfList)#这是data.frames中6的列表
I am able to get the mean() and sd() of each list like this: 我能够像这样获得每个列表的mean()和sd():
list_mean_sd <- lapply(dfList,
function(df)
{
df %>%
group_by(TC_INDICATOR) %>%
summarise(mean = mean(NET_SPEND),
sd = sd(NET_SPEND))
})
> str(list_mean_sd)
List of 6 (1 obs. of 2 variables:)
I can selected them individually for mean or sd: 我可以分别选择它们作为均值或标准差:
sapply(list_mean_sd, "[", "mean")
sapply(list_mean_sd, "[", "sd")
Basically, my goal is to id the outliers and remove them, product an alternative set, or after-set. 基本上,我的目标是找出异常值并删除它们,生成替代集或后置集。
**outliers are: mean - 3*sd() or mean + 3*sd()
I have this done, but with more manually steps, looking to learn how to loop through these sets and stuff like that, thanks in advance for helping me! 我已经做到了,但是需要更多的手动步骤,希望了解如何遍历这些集合和类似内容,在此先感谢您的帮助!
Give this a shot. 试一下。 First I create data which I split into six data frames which are housed in a list.
首先,我创建数据,并将其分为六个数据帧,这些数据帧存储在一个列表中。
set.seed(0)
test_data <- data.frame(id = 1:10000,
T_C = sample(c(TRUE, FALSE), size = 10000, replace = TRUE),
Sales = rnorm(n = 10000),
grp = sample(c("a", "b", "c", "d", "e", "f"),
size = 10000, replace = TRUE))
test_split <- split(test_data, test_data$grp)
Then, I use lapply
on this list to identify what I'm calling the z_scores
which are computed as the difference between the mean
of Sales
and each individual Sales
divided by the sd
of Sales
. 然后,我用
lapply
这个名单上找出什么我打电话z_scores
被计算为之间的差异mean
的Sales
和每个个体Sales
由分割sd
的Sales
。 Finally, we use filter on these to pull out the ones which have a z_score
with an absolute value over 3. 最后,我们对它们使用过滤器,以提取
z_score
的绝对值超过3的对象。
library(dplyr)
outlier_list <- lapply(test_split,
function(m) group_by(m, T_C) %>% mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
ungroup() %>% filter(abs(z_score) >= 3)
)
> outlier_list
$a
# A tibble: 5 × 5
id T_C Sales grp z_score
<int> <lgl> <dbl> <fctr> <dbl>
1 468 TRUE -2.995332 a -3.073314
2 3026 TRUE 3.028495 a 3.075258
3 5188 TRUE -3.097847 a -3.177952
4 7993 FALSE -3.571076 a -3.823983
5 9105 TRUE -3.216710 a -3.299276
$b
# A tibble: 6 × 5
id T_C Sales grp z_score
<int> <lgl> <dbl> <fctr> <dbl>
1 264 TRUE 3.003494 b 3.003329
2 2172 TRUE 3.001475 b 3.001326
3 2980 FALSE -3.176356 b -3.222782
4 3366 FALSE 3.009292 b 3.048559
5 7477 FALSE 3.348301 b 3.392265
6 7583 TRUE -3.089758 b -3.040911
$c
# A tibble: 2 × 5
id T_C Sales grp z_score
<int> <lgl> <dbl> <fctr> <dbl>
1 8078 TRUE 3.015343 c 3.129923
2 8991 FALSE 3.113526 c 3.058302
$d
# A tibble: 5 × 5
id T_C Sales grp z_score
<int> <lgl> <dbl> <fctr> <dbl>
1 544 TRUE 3.289070 d 3.168235
2 3791 FALSE 3.791938 d 3.769810
3 6771 FALSE -3.157741 d -3.166861
4 7864 TRUE 3.164128 d 3.045728
5 9371 TRUE -3.026884 d -3.024655
$e
# A tibble: 6 × 5
id T_C Sales grp z_score
<int> <lgl> <dbl> <fctr> <dbl>
1 186 FALSE 3.021541 e 3.046079
2 1211 TRUE 3.414337 e 3.343521
3 1665 TRUE 3.546282 e 3.473614
4 3765 FALSE 3.363641 e 3.391142
5 4172 TRUE 3.348820 e 3.278923
6 7973 FALSE -2.987790 e -3.015284
$f
# A tibble: 6 × 5
id T_C Sales grp z_score
<int> <lgl> <dbl> <fctr> <dbl>
1 1089 TRUE -3.195090 f -3.189979
2 2452 FALSE 3.287591 f 3.212317
3 3486 FALSE -3.334942 f -3.367962
4 4198 FALSE -3.102578 f -3.137082
5 8183 TRUE 3.081077 f 3.075324
6 8656 TRUE 3.253873 f 3.247822
Obviously, this will give you only the outliers. 显然,这只会给您异常值。 If you want to keep only the inliers, you change the
>= 3
to a < 3
. 如果只想保留inlier,则可以将
>= 3
更改为< 3
。
inlier_list <- lapply(test_split,
function(m) group_by(m, T_C) %>%
mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
ungroup() %>% filter(abs(z_score) < 3)
)
We just run lapply
on the list of inliers using the parameters noted in OP's comment. 我们只是使用OP注释中指出的参数在内部列表上运行
lapply
。
wilcox_test_res <- lapply(inlier_list,
function(m) wilcox.test(m$Sales ~ m$T_C,
mu= mean(m$Sales[m$T_C == TRUE]),
conf.level=0.95,
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.