R-迭代地應用變量列表的函數

Question

我的目標是創建一個函數，當該函數在數據框的多個變量上循環時，將返回一個新的數據框，其中包含每個變量每個級別的百分比和95％置信區間。

例如，如果我將此功能應用於mtcars數據幀中的“ cyl”和“ am”，我希望將其作為最終結果：

  variable level                ci.95
1      cyl     4 34.38 (19.50, 53.11)
2      cyl     6 21.88 (10.35, 40.45)
3      cyl     8 43.75 (27.10, 61.94)
4       am     0  59.38 (40.94, 75.5)
5       am     1 40.62 (24.50, 59.06)

因此，到目前為止，我的功能似乎對單個變量有效； 但是，我有兩個問題希望社區能夠為我提供幫助：

一般R修改我的代碼。 我仍然是R新手。 我已經閱讀了足夠多的文章，知道R愛好者通常不鼓勵使用for循環，但是我仍然在使用apply函數方面仍然感到困難（在大多數情況下，這似乎是for循環的替代方法）。
將此函數應用於變量列表-生成一個單個數據幀，其中包含該函數針對每個變量的每個級別返回的值。

到目前為止，這是我處理代碼的地方：

t1.props <- function(x, data = NULL) {

  # Grab dataframe and/or variable name
  if(!missing(data)){
    var <- data[,deparse(substitute(x))]
  } else {
    var <- x
  }

  # Grab variable name for use in ouput
  var.name <- substitute(x)

  # Omit observations with missing data
  var.clean <- na.omit(var)

  # Number of nonmissing observations
  n <- length(var.clean)

  # Grab levels of variable
  levels <- sort(unique(var.clean))

  # Create an empty data frame to store values
  out <- data.frame(variable = NA,
                    level = NA,
                    ci.95 = NA)

  # Estimate prop, se, and ci for each level of the variable
  for(i in seq_along(levels)) {
    prop <- paste0("prop", i)
    se <- paste0("se", i)
    log.prop <- paste0("log.trans", i)
    log.se <- paste0("log.se", i)
    log.l <- paste0("log.l", i)
    log.u <- paste0("log.u", i)
    lcl <- paste0("lcl", i)
    ucl <- paste0("ucl", i)

    # Find the proportion for each level of the variable
    assign(prop, sum(var.clean == levels[i]) / n)

    # Find the standard error for each level of the variable
    assign(se, sd(var.clean == levels[i]) /
             sqrt(length(var.clean == levels[i])))

    # Perform a logit transformation of the original percentage estimate
    assign(log.prop, log(get(prop)) - log(1 - get(prop)))

    # Transform the standard error of the percentage to a standard error of its
    # logit transformation
    assign(log.se, get(se) / (get(prop) * (1 - get(prop))))

    # Calculate the lower and upper confidence bounds of the logit
    # transformation
    assign(log.l,
           get(log.prop) -
           qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
    assign(log.u,
           get(log.prop) +
           qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))

    # Finally, perform inverse logit transformations to get the confidence bounds
    assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
    assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))

    # Create a combined 95% CI variable for easy copy/paste into Word tables
    ci.95 <- paste0(round(get(prop) * 100, 2), " ",
                "(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
                round(get(ucl) * 100, 2), ")")

    # Populate the "out" data frame with values
    out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
  }

  # Remove first (empty) row from out
  # But only in the first iteration
  if (is.na(out[1,1])) {
    out <- out[-1, ]
    rownames(out) <- 1:nrow(out)
  }
  out
}

data(mtcars)
t1.props(cyl, mtcars)

感謝您提供的任何幫助或建議。

Answer 1

關於您正在使用的所有函數的好處是它們已經被矢量化了（ sd和qt除外，但是您可以使用Vectorize輕松地對特定參數進行Vectorize ）。 這意味着您可以將向量傳遞給它們，而無需編寫單個循環。 我忽略了函數的准備輸入和調整輸出的部分。

t1.props <- function(var, data=mtcars) {
    N <- nrow(data)
    levels <- names(table(data[,var]))
    count <- unclass(table(data[,var]))        # counts
    prop <- count / N                          # proportions
    se <- sqrt(prop * (1-prop)/(N-1))          # standard errors of props.
    lprop <- log(prop) - log(1-prop)           # logged prop
    lse <- se / (prop*(1-prop))                # logged se
    stat <- Vectorize(qt, "df")(0.975, N-1)    # tstats
    llower <- lprop - stat*lse                 # log lower 
    lupper <- lprop + stat*lse                 # log upper
    lower <- exp(llower) / (1 + exp(llower))   # lower ci
    upper <- exp(lupper) / (1 + exp(lupper))   # upper ci

    data.frame(variable=var,
               level=levels,
               perc=100*prop,
               lower=100*lower,
               upper=100*upper)
}

因此，將函數應用於多個變量時，唯一的顯式應用/循環會如下

## Apply your function to two variables
do.call(rbind, lapply(c("cyl", "am"), t1.props))
#   variable level   perc    lower    upper
# 4      cyl     4 34.375 19.49961 53.11130
# 6      cyl     6 21.875 10.34883 40.44691
# 8      cyl     8 43.750 27.09672 61.94211
# 0       am     0 59.375 40.94225 75.49765
# 1       am     1 40.625 24.50235 59.05775

就代碼中的循環而言，就效率而言，並不是特別重要，但您可以看到簡明扼要的代碼閱讀起來容易得多，並且應用函數提供了許多簡單的單行解決方案。

我認為更改代碼中最重要的事情是使用assign和get 。 相反，您可以將變量存儲在列表或其他數據結構中，並在需要時使用setNames ， names<-或names(...) <-來命名組件。

Answer 2

您還可以使函數主要保持完整，並對其應用lapply ：

vars <- c("cyl", "am")
lapply(vars, t1.props, data=mtcars)
[[1]]
  variable level                ci.95
1      cyl     4 34.38 (19.50, 53.11)
2      cyl     6 21.88 (10.35, 40.45)
3      cyl     8 43.75 (27.10, 61.94)

[[2]]
  variable level                ci.95
1       am     0  59.38 (40.94, 75.5)
2       am     1 40.62 (24.50, 59.06)

並將它們全部合並為一個數據框，其中包括：

lst <- lapply(vars, t1.props, data=mtcars)
do.call(rbind,lst)

數據

您必須將var和var.name分配簡化為：

t1.props <- function(x, data = NULL) {

  # Grab dataframe and/or variable name
  if(!missing(data)){
    var <- data[,x]
  } else {
    var <- x
  }

  # Grab variable name for use in ouput
  var.name <- x

  # Omit observations with missing data
  var.clean <- na.omit(var)

  # Number of nonmissing observations
  n <- length(var.clean)

  # Grab levels of variable
  levels <- sort(unique(var.clean))

  # Create an empty data frame to store values
  out <- data.frame(variable = NA,
                    level = NA,
                    ci.95 = NA)

  # Estimate prop, se, and ci for each level of the variable
  for(i in seq_along(levels)) {
    prop <- paste0("prop", i)
    se <- paste0("se", i)
    log.prop <- paste0("log.trans", i)
    log.se <- paste0("log.se", i)
    log.l <- paste0("log.l", i)
    log.u <- paste0("log.u", i)
    lcl <- paste0("lcl", i)
    ucl <- paste0("ucl", i)

    # Find the proportion for each level of the variable
    assign(prop, sum(var.clean == levels[i]) / n)

    # Find the standard error for each level of the variable
    assign(se, sd(var.clean == levels[i]) /
             sqrt(length(var.clean == levels[i])))

    # Perform a logit transformation of the original percentage estimate
    assign(log.prop, log(get(prop)) - log(1 - get(prop)))

    # Transform the standard error of the percentage to a standard error of its
    # logit transformation
    assign(log.se, get(se) / (get(prop) * (1 - get(prop))))

    # Calculate the lower and upper confidence bounds of the logit
    # transformation
    assign(log.l,
           get(log.prop) -
             qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))
    assign(log.u,
           get(log.prop) +
             qt(.975, (length(var.clean == levels[i]) - 1)) * get(log.se))

    # Finally, perform inverse logit transformations to get the confidence bounds
    assign(lcl, exp(get(log.l)) / (1 + exp(get(log.l))))
    assign(ucl, exp(get(log.u)) / (1 + exp(get(log.u))))

    # Create a combined 95% CI variable for easy copy/paste into Word tables
    ci.95 <- paste0(round(get(prop) * 100, 2), " ",
                    "(", sprintf("%.2f", round(get(lcl) * 100, 2)), ",", " ",
                    round(get(ucl) * 100, 2), ")")

    # Populate the "out" data frame with values
    out <- rbind(out, c(as.character(var.name), levels[i], ci.95))
  }

  # Remove first (empty) row from out
  # But only in the first iteration
  if (is.na(out[1,1])) {
    out <- out[-1, ]
    rownames(out) <- 1:nrow(out)
  }
  out
}

R-迭代地應用變量列表的函數

問題描述

2 個解決方案

解決方案1
0 2015-08-04 06:48:04

解決方案2
0 已采納 2015-08-04 07:07:48

R-迭代地應用變量列表的函數

問題描述

2 個解決方案

解決方案1 0 2015-08-04 06:48:04

解決方案2 0 已采納 2015-08-04 07:07:48

解決方案1
0 2015-08-04 06:48:04

解決方案2
0 已采納 2015-08-04 07:07:48