简体   繁体   English

R-迭代地应用变量列表的函数

[英]R - iteratively apply a function of a list of variables

My goal is to create a function that, when looped over multiple variables of a data frame, will return a new data frame containing the percents and 95% confidence intervals for each level of each variable. 我的目标是创建一个函数,当该函数在数据框的多个变量上循环时,将返回一个新的数据框,其中包含每个变量每个级别的百分比和95%置信区间。

As an example, if I applied this function to "cyl" and "am" from the mtcars data frame, I would want this as the final result: 例如,如果我将此功能应用于mtcars数据帧中的“ cyl”和“ am”,我希望将其作为最终结果:

  variable level                ci.95
1      cyl     4 34.38 (19.50, 53.11)
2      cyl     6 21.88 (10.35, 40.45)
3      cyl     8 43.75 (27.10, 61.94)
4       am     0  59.38 (40.94, 75.5)
5       am     1 40.62 (24.50, 59.06) 

So, far I have function that seems to work for a single variable; 因此,到目前为止,我的功能似乎对单个变量有效; however, I have two issues that I'm hoping the community can help me with: 但是,我有两个问题希望社区能够为我提供帮助:

  1. General R-ifying my code. 一般R修改我的代码。 I'm still an R novice. 我仍然是R新手。 I've read enough posts to know that R enthusiasts generally discourage using for loops, but I still really struggle with using the apply functions (which seems to be the alternative to for loops in most cases). 我已经阅读了足够多的文章,知道R爱好者通常不鼓励使用for循环,但是我仍然在使用apply函数方面仍然感到困难(在大多数情况下,这似乎是for循环的替代方法)。

  2. Applying this function to a list of variables - resulting in a single data frame containing the returned values from the function for each level of each variable. 将此函数应用于变量列表-生成一个单个数据帧,其中包含该函数针对每个变量的每个级别返回的值。

Here's where I'm at with my code so far: 到目前为止,这是我处理代码的地方:

t1.props <- function(x, data = NULL) {

  # Grab dataframe and/or variable name
  if(!missing(data)){
    var <- data[,deparse(substitute(x))]
  } else {
    var <- x
  }

  # Grab variable name for use in ouput
  var.name <- substitute(x)

  # Omit observations with missing data
  var.clean <- na.omit(var)

  # Number of nonmissing observations
  n <- length(var.clean)

  # Grab levels of variable
  levels <- sort(unique(var.clean))

  # Create an empty data frame to store values
  out <- data.frame(variable = NA,
                    level = NA,
                    ci.95 = NA)

  # Estimate prop, se, and ci for each level of the variable
  for(i in seq_along(levels)) {
    prop <- paste0("prop", i)
    se <- paste0("se", i)
    log.prop <- paste0("log.trans", i)
    log.se <- paste0("log.se", i)
    log.l <- paste0("log.l", i)
    log.u <- paste0("log.u", i)
    lcl <- paste0("lcl", i)
    ucl <- paste0("ucl", i)

    # Find the proportion for each level of the variable
    assign(prop, sum(var.clean == levels[i]) / n)

    # Find the standard error for each level of the variable
    assign(se, sd(var.clean == levels[i]) /
             sqrt(length(var.clean == levels[i])))

    # Perform a logit transformation of the original percentage estimate
    assign(log.prop, log(get(prop)) - log(1 - get(prop)))

    # Transform the standard error of the percentage to a standard error of its
    # logit transformation
    assign(log.se, get(se) / (get(prop) * (1 - get(prop))))

    # Calculate the lower and upper confidence bounds of the logit
    # transformation
    assign(log.l,
           get(log.prop) -
           qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
    assign(log.u,
           get(log.prop) +
           qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))

    # Finally, perform inverse logit transformations to get the confidence bounds
    assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
    assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))

    # Create a combined 95% CI variable for easy copy/paste into Word tables
    ci.95 <- paste0(round(get(prop) * 100, 2), " ",
                "(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
                round(get(ucl) * 100, 2), ")")

    # Populate the "out" data frame with values
    out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
  }

  # Remove first (empty) row from out
  # But only in the first iteration
  if (is.na(out[1,1])) {
    out <- out[-1, ]
    rownames(out) <- 1:nrow(out)
  }
  out
}

data(mtcars)
t1.props(cyl, mtcars)

I appreciate any help or advice you have to offer. 感谢您提供的任何帮助或建议。

The nice thing about all the functions you're using is that they are already vectorized (except sd and qt , but you can easily vectorize them for specific arguments with Vectorize ). 关于您正在使用的所有函数的好处是它们已经被矢量化了( sdqt除外,但是您可以使用Vectorize轻松地对特定参数进行Vectorize )。 This means you can pass vectors to them without needing to write a single loop. 这意味着您可以将向量传递给它们,而无需编写单个循环。 I left out the parts of your function that deal with preparing the input and prettying up the output. 我忽略了函数的准备输入和调整输出的部分。

t1.props <- function(var, data=mtcars) {
    N <- nrow(data)
    levels <- names(table(data[,var]))
    count <- unclass(table(data[,var]))        # counts
    prop <- count / N                          # proportions
    se <- sqrt(prop * (1-prop)/(N-1))          # standard errors of props.
    lprop <- log(prop) - log(1-prop)           # logged prop
    lse <- se / (prop*(1-prop))                # logged se
    stat <- Vectorize(qt, "df")(0.975, N-1)    # tstats
    llower <- lprop - stat*lse                 # log lower 
    lupper <- lprop + stat*lse                 # log upper
    lower <- exp(llower) / (1 + exp(llower))   # lower ci
    upper <- exp(lupper) / (1 + exp(lupper))   # upper ci

    data.frame(variable=var,
               level=levels,
               perc=100*prop,
               lower=100*lower,
               upper=100*upper)
}

So, the only explicit applying/looping comes when you apply the function to multiple variables as follows 因此,将函数应用于多个变量时,唯一的显式应用/循环会如下

## Apply your function to two variables
do.call(rbind, lapply(c("cyl", "am"), t1.props))
#   variable level   perc    lower    upper
# 4      cyl     4 34.375 19.49961 53.11130
# 6      cyl     6 21.875 10.34883 40.44691
# 8      cyl     8 43.750 27.09672 61.94211
# 0       am     0 59.375 40.94225 75.49765
# 1       am     1 40.625 24.50235 59.05775

As far as the loop in your code, it's not like that is particularly important in terms of efficiency, but you can see how much easier code can be to read when its concise - and apply functions offer a lot of simple one-line solutions. 就代码中的循环而言,就效率而言,并不是特别重要,但您可以看到简明扼要的代码阅读起来容易得多,并且应用函数提供了许多简单的单行解决方案。

I think the most important thing to change in your code is the use of assign and get . 我认为更改代码中最重要的事情是使用assignget Instead, you can store variables in lists or another data structure, and use setNames , names<- , or names(...) <- to name the components when needed. 相反,您可以将变量存储在列表或其他数据结构中,并在需要时使用setNamesnames<-names(...) <-来命名组件。

You can also keep the function mainly intact and use lapply over it: 您还可以使函数主要保持完整,并对其应用lapply

vars <- c("cyl", "am")
lapply(vars, t1.props, data=mtcars)
[[1]]
  variable level                ci.95
1      cyl     4 34.38 (19.50, 53.11)
2      cyl     6 21.88 (10.35, 40.45)
3      cyl     8 43.75 (27.10, 61.94)

[[2]]
  variable level                ci.95
1       am     0  59.38 (40.94, 75.5)
2       am     1 40.62 (24.50, 59.06)

And combine them all into one data frame with: 并将它们全部合并为一个数据框,其中包括:

lst <- lapply(vars, t1.props, data=mtcars)
do.call(rbind,lst)

Data 数据

You must simplify the var and var.name assignments to: 您必须将varvar.name分配简化为:

t1.props <- function(x, data = NULL) {

  # Grab dataframe and/or variable name
  if(!missing(data)){
    var <- data[,x]
  } else {
    var <- x
  }

  # Grab variable name for use in ouput
  var.name <- x

  # Omit observations with missing data
  var.clean <- na.omit(var)

  # Number of nonmissing observations
  n <- length(var.clean)

  # Grab levels of variable
  levels <- sort(unique(var.clean))

  # Create an empty data frame to store values
  out <- data.frame(variable = NA,
                    level = NA,
                    ci.95 = NA)

  # Estimate prop, se, and ci for each level of the variable
  for(i in seq_along(levels)) {
    prop <- paste0("prop", i)
    se <- paste0("se", i)
    log.prop <- paste0("log.trans", i)
    log.se <- paste0("log.se", i)
    log.l <- paste0("log.l", i)
    log.u <- paste0("log.u", i)
    lcl <- paste0("lcl", i)
    ucl <- paste0("ucl", i)

    # Find the proportion for each level of the variable
    assign(prop, sum(var.clean == levels[i]) / n)

    # Find the standard error for each level of the variable
    assign(se, sd(var.clean == levels[i]) /
             sqrt(length(var.clean == levels[i])))

    # Perform a logit transformation of the original percentage estimate
    assign(log.prop, log(get(prop)) - log(1 - get(prop)))

    # Transform the standard error of the percentage to a standard error of its
    # logit transformation
    assign(log.se, get(se) / (get(prop) * (1 - get(prop))))

    # Calculate the lower and upper confidence bounds of the logit
    # transformation
    assign(log.l,
           get(log.prop) -
             qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
    assign(log.u,
           get(log.prop) +
             qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))

    # Finally, perform inverse logit transformations to get the confidence bounds
    assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
    assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))

    # Create a combined 95% CI variable for easy copy/paste into Word tables
    ci.95 <- paste0(round(get(prop) * 100, 2), " ",
                    "(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
                    round(get(ucl) * 100, 2), ")")

    # Populate the "out" data frame with values
    out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
  }

  # Remove first (empty) row from out
  # But only in the first iteration
  if (is.na(out[1,1])) {
    out <- out[-1, ]
    rownames(out) <- 1:nrow(out)
  }
  out
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM