R-ddply函数覆盖循环变量

Question

I need to loop over a data frame and calculate functions over the variable that is being looped. 我需要遍历一个数据帧并计算要被遍历的变量的函数。

A table example: 表格示例：

    table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20), 
    char1=c(rep('a',10), rep('b',10)), 
    target=c(rep(1,10), rep(0,10)))

I create a list of variables: 我创建一个变量列表：

nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']

And the table that I will populate: 我将填充的表：

planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]

And the loop: 和循环：

library(plyr)

for (i in 1:length(nums)){ 
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
        max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}

I need to get the min and max of the variable por each decile. 我需要获取每个十分位数的变量的最小值和最大值。 But instead I get: 但是我得到了：

   deciles  min  max pos
1        1 num1 num1   4
2        2 num2 num2   4
3        3 <NA> <NA>   2
4        4 <NA> <NA>   0
5        5 <NA> <NA>   0
6        1 num1 num1   4
7        2 num2 num2   4
8        3 <NA> <NA>   2
9        4 <NA> <NA>   0
10       5 <NA> <NA>   0

For variable num1 I need to get the result of: 对于变量num1，我需要得到以下结果：

ddply(table, 'deciles', summarize, min=min(num1),
        max=max(num1),pos=sum(target))


  deciles      min       max pos
       1 5.736842  7.157895   0
       2 7.631579  9.052632   0
       3 1.000000 10.000000   2
       4 1.947368  3.368421   4
       5 3.842105  5.263158   4

And below the result of doing the same with num2. 而下面的结果与num2相同。

I understand that I need to introduce the variable with the following form: 我了解我需要以以下形式介绍变量：

num1 num1

but the code is writing 但是代码在写

'num1' 'num1'

I tried with: 我尝试过：

min=min(as.name(nums[i]))

But I get an error: 但是我得到一个错误：

Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument min（as.name（nums [i]））中的错误：'type'（符号）无效参数

how can I calculate a function over the variable that is being looped? 如何计算正在循环的变量的函数？

Answer 1

The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r. 您问题的要点是在split-apply-combine方法上应用函数列表，因此这是在base r中执行此操作的一种方法。

## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20), 
                  char1=c(rep('a',10), rep('b',10)), 
                  target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)

FUNS <- list(min = min, max = max, mean = mean)

## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
  setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))

## combine results
do.call('rbind', x)
#        min       max     mean
# 1 1.000000  2.421053 1.710526
# 2 2.894737  4.315789 3.605263
# 3 4.789474  6.210526 5.500000
# 4 6.684211  8.105263 7.394737
# 5 8.578947 10.000000 9.289474

Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below 无需使用循环，因为上面的方法可以正常工作并且非常简单，因此可以将其放入下面的函数中

f <- function(num, data = table) {
  FUNS <- list(min = min, max = max, mean = mean)

  x <- tapply(data[, num], data[, 'deciles'], function(x)
    setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))

  cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}

This way we have the method generalized so it can use any column you have with any data you have. 这样，我们就可以对方法进行一般化，因此它可以将您拥有的任何列与您拥有的任何数据一起使用。 You can call it for individual columns like 您可以为单个列调用它，例如

f('num1')
f('num2')

Or use a loop to get everything at once 或者使用循环一次获取所有内容

lapply(c('num1','num2'), f)

# [[1]]
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474
# 
# [[2]]
#   deciles      min      max     mean
# 1       1 20.00000 21.57895 20.78947
# 2       2 22.10526 23.68421 22.89474
# 3       3 24.21053 25.78947 25.00000
# 4       4 26.31579 27.89474 27.10526
# 5       5 28.42105 30.00000 29.21053

If you don't like lapply , you can Vectorize the function to make it a little easier: 如果您不喜欢lapply ，则可以对函数进行Vectorize ，以使其更加简单：

Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))

Which you would more commonly use like this ( SIMPLIFY = FALSE to retain the list structures) 您通常会这样使用（ SIMPLIFY = FALSE来保留列表结构）

v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))

# $num1
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474
# 
# $num1
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474

Answer 2

I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _ ): 我将严格喜欢使用dplyr这一点，即使有在呼叫处理字符串变量名的一些丑陋summarize_ （注意结尾_ ）：

library(lazyeval)
library(dplyr)

# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
                 num2=seq(20,30,len=20),
                 char1=c(rep('a',10), rep('b',10)),
                 target=c(rep(1,10), rep(0,10))
)

# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")

# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
  lapply(
    numericCols, function(x) {
      # compute the quantiles
      quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))

      # create quantile membership
      dfX[["quantile_membership"]] =
        findInterval(dfX[[x]], vec = quantiles,
                     rightmost.closed = TRUE,
                     all.inside = TRUE)

      # summarize variables by decile
      dfX %>%
        group_by(quantile_membership)   %>%
        summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
                   max = interp( ~ max(x_name), x_name = as.name(x)),
                   mean = interp( ~ mean(x_name), x_name = as.name(x)))
    }),
  numericCols
)

# inspect the output
liDFY[[numericCols[1]]]

R-ddply函数覆盖循环变量

问题描述

2 个解决方案

解决方案1
1 2015-11-09 15:22:21

解决方案2
0 已采纳 2015-11-09 14:20:53

R-ddply函数覆盖循环变量

问题描述

2 个解决方案

解决方案1 1 2015-11-09 15:22:21

解决方案2 0 已采纳 2015-11-09 14:20:53

解决方案1
1 2015-11-09 15:22:21

解决方案2
0 已采纳 2015-11-09 14:20:53