简体   繁体   中英

R-ddply function over loop variable

I need to loop over a data frame and calculate functions over the variable that is being looped.

A table example:

    table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20), 
    char1=c(rep('a',10), rep('b',10)), 
    target=c(rep(1,10), rep(0,10)))

I create a list of variables:

nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']

And the table that I will populate:

planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]

And the loop:

library(plyr)

for (i in 1:length(nums)){ 
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
        max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}

I need to get the min and max of the variable por each decile. But instead I get:

   deciles  min  max pos
1        1 num1 num1   4
2        2 num2 num2   4
3        3 <NA> <NA>   2
4        4 <NA> <NA>   0
5        5 <NA> <NA>   0
6        1 num1 num1   4
7        2 num2 num2   4
8        3 <NA> <NA>   2
9        4 <NA> <NA>   0
10       5 <NA> <NA>   0

For variable num1 I need to get the result of:

ddply(table, 'deciles', summarize, min=min(num1),
        max=max(num1),pos=sum(target))


  deciles      min       max pos
       1 5.736842  7.157895   0
       2 7.631579  9.052632   0
       3 1.000000 10.000000   2
       4 1.947368  3.368421   4
       5 3.842105  5.263158   4

And below the result of doing the same with num2.

I understand that I need to introduce the variable with the following form:

num1

but the code is writing

'num1'

I tried with:

min=min(as.name(nums[i]))

But I get an error:

Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument

how can I calculate a function over the variable that is being looped?

The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.

## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20), 
                  char1=c(rep('a',10), rep('b',10)), 
                  target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)

FUNS <- list(min = min, max = max, mean = mean)

## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
  setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))

## combine results
do.call('rbind', x)
#        min       max     mean
# 1 1.000000  2.421053 1.710526
# 2 2.894737  4.315789 3.605263
# 3 4.789474  6.210526 5.500000
# 4 6.684211  8.105263 7.394737
# 5 8.578947 10.000000 9.289474

Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below

f <- function(num, data = table) {
  FUNS <- list(min = min, max = max, mean = mean)

  x <- tapply(data[, num], data[, 'deciles'], function(x)
    setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))

  cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}

This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like

f('num1')
f('num2')

Or use a loop to get everything at once

lapply(c('num1','num2'), f)

# [[1]]
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474
# 
# [[2]]
#   deciles      min      max     mean
# 1       1 20.00000 21.57895 20.78947
# 2       2 22.10526 23.68421 22.89474
# 3       3 24.21053 25.78947 25.00000
# 4       4 26.31579 27.89474 27.10526
# 5       5 28.42105 30.00000 29.21053

If you don't like lapply , you can Vectorize the function to make it a little easier:

Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))

Which you would more commonly use like this ( SIMPLIFY = FALSE to retain the list structures)

v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))

# $num1
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474
# 
# $num1
#   deciles      min       max     mean
# 1       1 1.000000  2.421053 1.710526
# 2       2 2.894737  4.315789 3.605263
# 3       3 4.789474  6.210526 5.500000
# 4       4 6.684211  8.105263 7.394737
# 5       5 8.578947 10.000000 9.289474

I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _ ):

library(lazyeval)
library(dplyr)

# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
                 num2=seq(20,30,len=20),
                 char1=c(rep('a',10), rep('b',10)),
                 target=c(rep(1,10), rep(0,10))
)

# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")

# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
  lapply(
    numericCols, function(x) {
      # compute the quantiles
      quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))

      # create quantile membership
      dfX[["quantile_membership"]] =
        findInterval(dfX[[x]], vec = quantiles,
                     rightmost.closed = TRUE,
                     all.inside = TRUE)

      # summarize variables by decile
      dfX %>%
        group_by(quantile_membership)   %>%
        summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
                   max = interp( ~ max(x_name), x_name = as.name(x)),
                   mean = interp( ~ mean(x_name), x_name = as.name(x)))
    }),
  numericCols
)

# inspect the output
liDFY[[numericCols[1]]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM