简体   繁体   中英

R : group by with custom functions

I have managed to aggregate data successfully using the following pattern:

newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]

However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring.

The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. I saw use of sdcols in another SO answer and tried something like:

customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]

Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? Thank you very much.

Literal example as requested:

> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
    Z column someOtherColumn
1 abc      1               5
2 abc      2               6
3 def      3               7
4 abc      4               8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
     Z        V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
> 

As you can see, DF is not grouped. There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z.

As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that ' dt$column ' will mysteriously give you the 'current value at the current row'. Instead it gives you the entire column (a vector). The function is passed the entire data table, not row-wise bits of data.

So, replacing the value in the return statement with something that represents a single value works. Example:

customfunc <- function(dt){
  q = unname(quantile(dt$column,0.25))
  n = nrow(dt[dt$column <= q])
  return(n/length(dt$someOtherColumn))
}

> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
    Z column someOtherColumn
1 abc      1               5
2 abc      2               6
3 def      3               7
4 abc      4               8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
     Z        V1
1: abc 0.3333333
2: def 1.0000000

Now the data is aggregated correctly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM