![](/img/trans.png)
[英]How to extract a vector in a tibble column to multiple columns in the same tibble?
[英]How to extract list elements into multiple tibble columns in r?
我有一個非常大的 tibble 形式的數據集。 我想使用一些返回列表的函數來總結數據。 我對列表的幾個組件感興趣,我想將我需要的每個組件返回到新的 tibble 列中。
這是一個例子
library(tibble)
library(dplyr)
# Create a data set of 1,000 random values in 100 subgroups with sample size 10
contrived_data <- tibble(subgroup = rep(1:100, each = 10),
value = rnorm(1000, mean = 5, sd = 1))
# Run the KS test vs. normal distribution on each sample of size 10. Return the KS statistic and p-value
# into new tibble columns
contrived_data %>% group_by(subgroup) %>%
summarize(avg = mean(value),
std_dev = sd(value),
ks_stat = ks.test(value, "pnorm", mean = 5, sd = 1)$statistic,
ks_pval = ks.test(value, "pnorm", mean = 5, sd = 1)$p.value)
以這種方式運行它會得到我想要的結果,但效率不高。 兩次調用ks.test
函數意味着執行時間(幾乎)加倍。 似乎必須有一種更有效的方法來通過單個函數調用來提取這兩個列表組件,但我不知道該怎么做。
您可以定義函數並使用來自 purrr 的映射:
library(tibble)
library(dplyr)
library(purrr)
func = function(DA){
kstest = ks.test(DA$value, "pnorm", mean = 5, sd = 1)
data.frame(
subgroup = unique(DA$subgroup),
avg=mean(DA$value),
std_dev = sd(DA$value),
ks_stat = kstest$statistic,
ks_pval = kstest$p.value)
}
contrived_data %>%
split(.$subgroup) %>%
map_dfr(func)
測試可以運行一次並包裝在一個list
,然后使用map
(來自purrr
)來提取值
library(purrr)
library(dplyr)
library(tidyr)
contrived_data %>%
group_by(subgroup) %>%
summarize(avg = mean(value),
std_dev = sd(value),
test = list(ks.test(value, "pnorm", mean = 5, sd = 1))) %>%
mutate(out = map(test, ~ tibble(ks_stat = .x$statistic,
ks_pval = .x$p.value))) %>%
unnest_wider(c(out)) %>%
select(-test)
# A tibble: 100 x 5
# subgroup avg std_dev ks_stat ks_pval
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 4.52 0.675 0.375 0.0907
# 2 2 5.17 1.02 0.342 0.152
# 3 3 5.02 0.909 0.141 0.972
# 4 4 5.08 0.846 0.313 0.227
# 5 5 4.82 0.819 0.225 0.614
# 6 6 5.07 0.866 0.159 0.928
# 7 7 4.94 0.914 0.145 0.966
# 8 8 5.52 1.01 0.290 0.306
# 9 9 5.17 0.787 0.258 0.443
#10 10 4.61 1.15 0.476 0.0132
# … with 90 more rows
另一種選擇是tidy
輸出(使用broom
)並一次提取所有組件
library(broom)
contrived_data %>%
group_by(subgroup) %>%
summarize(avg = mean(value),
std_dev = sd(value),
out = list(tidy(ks.test(value, "pnorm", mean = 5, sd = 1)))) %>%
unnest_wider(c(out))
使用rowwise
命令的dplyr
解決方案,它執行與map
相同的任務。
contrived_data %>%
group_by(subgroup) %>%
summarise(
avg = mean(value),
std_dev = sd(value),
ks_test = list(ks.test(value,"pnorm",mean=5,sd=1))
) %>%
ungroup() %>%
rowwise() %>%
mutate(
ks_stat = ks_test$statistic,
ks_pval = ks_test$p.value
) %>%
ungroup() %>%
select(-ks_test)
# A tibble: 100 x 5
# subgroup avg std_dev ks_stat ks_pval
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 1 5.10 1.24 0.186 0.819
# 2 2 4.86 0.805 0.231 0.584
# 3 3 5.24 0.729 0.258 0.445
# 4 4 5.16 0.642 0.307 0.247
# 5 5 4.63 0.752 0.393 0.0664
# Benchmark using rbenchmark:
# test replications elapsed relative user.self sys.self user.child sys.child
#2 nested 1000 10.58 1.000 10.58 0 NA NA
#1 original 1000 16.75 1.583 16.73 0 NA NA
您可以使用 group_modify
library(tidyverse)
contrived_data %>%
group_by(subgroup) %>%
group_modify(~{
ks <- ks.test(.$value, "pnorm", mean = 5, sd = 1)
tibble(
avg = mean(.$value),
std_dev = sd(.$value),
ks_stat = ks$statistic,
ks_pval = ks$p.value)
})
或者使用 data.table
library(data.table)
setDT(contrived_data)
contrived_data[, {
ks <- ks.test(value, "pnorm", mean = 5, sd = 1)
.(avg = mean(value),
std_dev = sd(value),
ks_stat = ks$statistic,
ks_pval = ks$p.value)
}, by = subgroup]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.