[英]R - iteratively apply a function of a list of variables
My goal is to create a function that, when looped over multiple variables of a data frame, will return a new data frame containing the percents and 95% confidence intervals for each level of each variable. 我的目标是创建一个函数,当该函数在数据框的多个变量上循环时,将返回一个新的数据框,其中包含每个变量每个级别的百分比和95%置信区间。
As an example, if I applied this function to "cyl" and "am" from the mtcars data frame, I would want this as the final result: 例如,如果我将此功能应用于mtcars数据帧中的“ cyl”和“ am”,我希望将其作为最终结果:
variable level ci.95
1 cyl 4 34.38 (19.50, 53.11)
2 cyl 6 21.88 (10.35, 40.45)
3 cyl 8 43.75 (27.10, 61.94)
4 am 0 59.38 (40.94, 75.5)
5 am 1 40.62 (24.50, 59.06)
So, far I have function that seems to work for a single variable; 因此,到目前为止,我的功能似乎对单个变量有效; however, I have two issues that I'm hoping the community can help me with: 但是,我有两个问题希望社区能够为我提供帮助:
General R-ifying my code. 一般R修改我的代码。 I'm still an R novice. 我仍然是R新手。 I've read enough posts to know that R enthusiasts generally discourage using for
loops, but I still really struggle with using the apply functions (which seems to be the alternative to for
loops in most cases). 我已经阅读了足够多的文章,知道R爱好者通常不鼓励使用for
循环,但是我仍然在使用apply函数方面仍然感到困难(在大多数情况下,这似乎是for
循环的替代方法)。
Applying this function to a list of variables - resulting in a single data frame containing the returned values from the function for each level of each variable. 将此函数应用于变量列表-生成一个单个数据帧,其中包含该函数针对每个变量的每个级别返回的值。
Here's where I'm at with my code so far: 到目前为止,这是我处理代码的地方:
t1.props <- function(x, data = NULL) {
# Grab dataframe and/or variable name
if(!missing(data)){
var <- data[,deparse(substitute(x))]
} else {
var <- x
}
# Grab variable name for use in ouput
var.name <- substitute(x)
# Omit observations with missing data
var.clean <- na.omit(var)
# Number of nonmissing observations
n <- length(var.clean)
# Grab levels of variable
levels <- sort(unique(var.clean))
# Create an empty data frame to store values
out <- data.frame(variable = NA,
level = NA,
ci.95 = NA)
# Estimate prop, se, and ci for each level of the variable
for(i in seq_along(levels)) {
prop <- paste0("prop", i)
se <- paste0("se", i)
log.prop <- paste0("log.trans", i)
log.se <- paste0("log.se", i)
log.l <- paste0("log.l", i)
log.u <- paste0("log.u", i)
lcl <- paste0("lcl", i)
ucl <- paste0("ucl", i)
# Find the proportion for each level of the variable
assign(prop, sum(var.clean == levels[i]) / n)
# Find the standard error for each level of the variable
assign(se, sd(var.clean == levels[i]) /
sqrt(length(var.clean == levels[i])))
# Perform a logit transformation of the original percentage estimate
assign(log.prop, log(get(prop)) - log(1 - get(prop)))
# Transform the standard error of the percentage to a standard error of its
# logit transformation
assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
# Calculate the lower and upper confidence bounds of the logit
# transformation
assign(log.l,
get(log.prop) -
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
assign(log.u,
get(log.prop) +
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
# Finally, perform inverse logit transformations to get the confidence bounds
assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
# Create a combined 95% CI variable for easy copy/paste into Word tables
ci.95 <- paste0(round(get(prop) * 100, 2), " ",
"(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
round(get(ucl) * 100, 2), ")")
# Populate the "out" data frame with values
out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
}
# Remove first (empty) row from out
# But only in the first iteration
if (is.na(out[1,1])) {
out <- out[-1, ]
rownames(out) <- 1:nrow(out)
}
out
}
data(mtcars)
t1.props(cyl, mtcars)
I appreciate any help or advice you have to offer. 感谢您提供的任何帮助或建议。
The nice thing about all the functions you're using is that they are already vectorized (except sd
and qt
, but you can easily vectorize them for specific arguments with Vectorize
). 关于您正在使用的所有函数的好处是它们已经被矢量化了( sd
和qt
除外,但是您可以使用Vectorize
轻松地对特定参数进行Vectorize
)。 This means you can pass vectors to them without needing to write a single loop. 这意味着您可以将向量传递给它们,而无需编写单个循环。 I left out the parts of your function that deal with preparing the input and prettying up the output. 我忽略了函数的准备输入和调整输出的部分。
t1.props <- function(var, data=mtcars) {
N <- nrow(data)
levels <- names(table(data[,var]))
count <- unclass(table(data[,var])) # counts
prop <- count / N # proportions
se <- sqrt(prop * (1-prop)/(N-1)) # standard errors of props.
lprop <- log(prop) - log(1-prop) # logged prop
lse <- se / (prop*(1-prop)) # logged se
stat <- Vectorize(qt, "df")(0.975, N-1) # tstats
llower <- lprop - stat*lse # log lower
lupper <- lprop + stat*lse # log upper
lower <- exp(llower) / (1 + exp(llower)) # lower ci
upper <- exp(lupper) / (1 + exp(lupper)) # upper ci
data.frame(variable=var,
level=levels,
perc=100*prop,
lower=100*lower,
upper=100*upper)
}
So, the only explicit applying/looping comes when you apply the function to multiple variables as follows 因此,将函数应用于多个变量时,唯一的显式应用/循环会如下
## Apply your function to two variables
do.call(rbind, lapply(c("cyl", "am"), t1.props))
# variable level perc lower upper
# 4 cyl 4 34.375 19.49961 53.11130
# 6 cyl 6 21.875 10.34883 40.44691
# 8 cyl 8 43.750 27.09672 61.94211
# 0 am 0 59.375 40.94225 75.49765
# 1 am 1 40.625 24.50235 59.05775
As far as the loop in your code, it's not like that is particularly important in terms of efficiency, but you can see how much easier code can be to read when its concise - and apply functions offer a lot of simple one-line solutions. 就代码中的循环而言,就效率而言,并不是特别重要,但您可以看到简明扼要的代码阅读起来容易得多,并且应用函数提供了许多简单的单行解决方案。
I think the most important thing to change in your code is the use of assign
and get
. 我认为更改代码中最重要的事情是使用assign
和get
。 Instead, you can store variables in lists or another data structure, and use setNames
, names<-
, or names(...) <-
to name the components when needed. 相反,您可以将变量存储在列表或其他数据结构中,并在需要时使用setNames
, names<-
或names(...) <-
来命名组件。
You can also keep the function mainly intact and use lapply
over it: 您还可以使函数主要保持完整,并对其应用lapply
:
vars <- c("cyl", "am")
lapply(vars, t1.props, data=mtcars)
[[1]]
variable level ci.95
1 cyl 4 34.38 (19.50, 53.11)
2 cyl 6 21.88 (10.35, 40.45)
3 cyl 8 43.75 (27.10, 61.94)
[[2]]
variable level ci.95
1 am 0 59.38 (40.94, 75.5)
2 am 1 40.62 (24.50, 59.06)
And combine them all into one data frame with: 并将它们全部合并为一个数据框,其中包括:
lst <- lapply(vars, t1.props, data=mtcars)
do.call(rbind,lst)
Data 数据
You must simplify the var
and var.name
assignments to: 您必须将var
和var.name
分配简化为:
t1.props <- function(x, data = NULL) {
# Grab dataframe and/or variable name
if(!missing(data)){
var <- data[,x]
} else {
var <- x
}
# Grab variable name for use in ouput
var.name <- x
# Omit observations with missing data
var.clean <- na.omit(var)
# Number of nonmissing observations
n <- length(var.clean)
# Grab levels of variable
levels <- sort(unique(var.clean))
# Create an empty data frame to store values
out <- data.frame(variable = NA,
level = NA,
ci.95 = NA)
# Estimate prop, se, and ci for each level of the variable
for(i in seq_along(levels)) {
prop <- paste0("prop", i)
se <- paste0("se", i)
log.prop <- paste0("log.trans", i)
log.se <- paste0("log.se", i)
log.l <- paste0("log.l", i)
log.u <- paste0("log.u", i)
lcl <- paste0("lcl", i)
ucl <- paste0("ucl", i)
# Find the proportion for each level of the variable
assign(prop, sum(var.clean == levels[i]) / n)
# Find the standard error for each level of the variable
assign(se, sd(var.clean == levels[i]) /
sqrt(length(var.clean == levels[i])))
# Perform a logit transformation of the original percentage estimate
assign(log.prop, log(get(prop)) - log(1 - get(prop)))
# Transform the standard error of the percentage to a standard error of its
# logit transformation
assign(log.se, get(se) / (get(prop) * (1 - get(prop))))
# Calculate the lower and upper confidence bounds of the logit
# transformation
assign(log.l,
get(log.prop) -
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
assign(log.u,
get(log.prop) +
qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
# Finally, perform inverse logit transformations to get the confidence bounds
assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))
# Create a combined 95% CI variable for easy copy/paste into Word tables
ci.95 <- paste0(round(get(prop) * 100, 2), " ",
"(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
round(get(ucl) * 100, 2), ")")
# Populate the "out" data frame with values
out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
}
# Remove first (empty) row from out
# But only in the first iteration
if (is.na(out[1,1])) {
out <- out[-1, ]
rownames(out) <- 1:nrow(out)
}
out
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.