[英]Transforming R dataframe by applying function rowwise and create (possibly) larger columns
我正在尝试通过将每一行用作函数参数来转换数据框(tibble)并从中创建一个新列,该列可能大于参数数量。 考虑以下示例,其中我有一些样本观察:
library(dplyr)
library(stringi)
observations <- c("110", "11011", "1100010")
df <- tibble(obs = observations) %>%
transmute(
Failure = stri_count(obs, fixed = "0"),
Success = stri_count(obs, fixed = "1")
)
df 然后是:
# A tibble: 3 x 2
Failure Success
<int> <int>
1 1 2
2 1 4
3 4 3
我想获取每一行并将其用于计算一堆值,并将每个结果向量保存在一个新列中。 例如我想做:
p_values = pgrid <- seq(from = 0, to = 1, length.out = 11)
df %>%
rowwise() %>%
transmute(
p = p_values,
likelihood = dbinom(Success,
size = Failure + Success,
prob = p_values
)
)
Error: Column `p` must be length 1 (the group size), not 11
并得到类似的东西:
# A tibble: 4 x 11
p_values likelihood_1 likelihood_2 likelihood_3
<float> <float> <float> <float>
1 0 ... ... ...
2 0.1 ... ... ...
... ... ... ... ...
10 0.9 ... ... ...
11 1 ... ... ...
使用 tidyverse 方法时,这种工作流程可能会有些尴尬,因为数据不是“整洁”的格式。
我会从另一个角度来看,从p_values
向量开始:
likelihoods <-
tibble(p = p_values) %>%
mutate(likelihood_1 = dbinom(df[1,]$Success,size = df[1,]$Failure + df[1,]$Success,prob = p),
likelihood_2 = dbinom(df[2,]$Success,size = df[2,]$Failure + df[2,]$Success,prob = p),
likelihood_3 = dbinom(df[3,]$Success,size = df[3,]$Failure + df[3,]$Success,prob = p))
问题是transmute
或mutate
期望元素数与行数相同(或者如果它被分组,那么该组的行数)。 在这里,我们按行rowwise
- 这基本上是对每一行进行分组,因此预期的n()
为 1,而输出是 'p_values' 的length
。 一种选择是使用pivot_wider
(如果需要)包装在一个list
, unnest
和重塑为“wide”格式
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(grp = str_c('likelihood_', row_number())) %>%
rowwise() %>%
transmute(grp, p = list(p_values),
likelihood = list(dbinom(Success,
size = Failure + Success,
prob = p_values
))
) %>%
unnest(c(p, likelihood)) %>%
pivot_wider(names_from = grp, values_from = likelihood)
# A tibble: 11 x 4
# p likelihood_1 likelihood_2 likelihood_3
# <dbl> <dbl> <dbl> <dbl>
# 1 0 0 0 0
# 2 0.1 0.027 0.00045 0.0230
# 3 0.2 0.096 0.0064 0.115
# 4 0.3 0.189 0.0284 0.227
# 5 0.4 0.288 0.0768 0.290
# 6 0.5 0.375 0.156 0.273
# 7 0.6 0.432 0.259 0.194
# 8 0.7 0.441 0.360 0.0972
# 9 0.8 0.384 0.410 0.0287
#10 0.9 0.243 0.328 0.00255
#11 1 0 0 0
我实际上会为此切换到purrr
。 函数pmap()
将逐行迭代。 我们使用..1
和..2
分别表示第一个和第二个输入。 使用pmap_dfc()
将按列(dfc = 数据框列)绑定结果。
library(purrr)
library(tibble)
df %>%
pmap_dfc(~ dbinom(..2, size = ..1 + ..2, prob = p_values)) %>%
set_names(paste0("likelihood_", seq_along(.))) %>%
add_column(p_values = p_values, .before = 1)
# A tibble: 11 x 4
p_values likelihood_1 likelihood_2 likelihood_3
<dbl> <dbl> <dbl> <dbl>
1 0 0 0 0
2 0.1 0.027 0.00045 0.0230
3 0.2 0.096 0.0064 0.115
4 0.3 0.189 0.0284 0.227
5 0.4 0.288 0.0768 0.290
6 0.5 0.375 0.156 0.273
7 0.6 0.432 0.259 0.194
8 0.7 0.441 0.360 0.0972
9 0.8 0.384 0.410 0.0287
10 0.9 0.243 0.328 0.00255
11 1 0 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.