简体   繁体   English

R:使用自定义函数分组

[英]R : group by with custom functions

I have managed to aggregate data successfully using the following pattern: 我已成功使用以下模式聚合数据:

newdf <- setDT(df)[, list(X=sum(x),Y=max(y)), by=Z]

However, the moment I try to do anything more complicated, although the code runs, it no longer aggregates by Z: it seems to create a dataframe with the same number of observations as the original df so I know that no grouping is actually occurring. 然而,当我尝试做任何更复杂的事情时,尽管代码运行,它不再由Z聚合:它似乎创建一个与原始df具有相同观察数量的数据帧,因此我知道实际上没有分组。

The custom function I would like to apply is to find the n-quantile for the current list of values and then do some other stuff with it. 我想要应用的自定义函数是找到当前值列表的n分位数,然后用它做一些其他的东西。 I saw use of sdcols in another SO answer and tried something like: 我在另一个SO答案中看到了使用sdcols并尝试了类似的方法:

customfunc <- function(dt){
q = unname(quantile(dt$column,0.25))
n = nrow(dt[dt$column <= q])
return(n/dt$someOtherColumn)
}
#fails to group anything!!! also rather slow...
newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c(column, someOtherColumn)]

Can someone please help me figure out what is wrong with the way I'm trying to use group by and custom functions? 有人可以帮我弄清楚我试图使用分组和自定义功能的方式有什么问题吗? Thank you very much. 非常感谢你。

Literal example as requested: 要求的文字示例:

> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
    Z column someOtherColumn
1 abc      1               5
2 abc      2               6
3 def      3               7
4 abc      4               8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
     Z        V1
1: abc 0.2000000
2: abc 0.1666667
3: abc 0.1250000
4: def 0.1428571
> 

As you can see, DF is not grouped. 如您所见,DF未分组。 There should just be two rows, one for "abc", and another for "def" since I am trying to group by Z. 应该只有两行,一行用于“abc”,另一行用于“def”,因为我试图按Z分组。

As guided by eddi's point above, the basic problem is thinking that your custom function is being called inside a loop and that ' dt$column ' will mysteriously give you the 'current value at the current row'. 正如上面eddi的指导,基本问题是认为你的自定义函数是在一个循环中被调用的,' dt$column '会神秘地给你'当前行的当前值'。 Instead it gives you the entire column (a vector). 相反,它为您提供整个列(向量)。 The function is passed the entire data table, not row-wise bits of data. 该函数传递整个数据表,而不是行数据位。

So, replacing the value in the return statement with something that represents a single value works. 因此,使用表示单个值的内容替换return语句中的 Example: 例:

customfunc <- function(dt){
  q = unname(quantile(dt$column,0.25))
  n = nrow(dt[dt$column <= q])
  return(n/length(dt$someOtherColumn))
}

> df <- data.frame(Z=c("abc","abc","def","abc"), column=c(1,2,3,4), someOtherColumn=c(5,6,7,8))
> df
    Z column someOtherColumn
1 abc      1               5
2 abc      2               6
3 def      3               7
4 abc      4               8
> newdf <- setDT(df)[, customfunc(.SD), by=Z, .SDcols=c("column", "someOtherColumn")]
> newdf
     Z        V1
1: abc 0.3333333
2: def 1.0000000

Now the data is aggregated correctly. 现在数据已正确聚合。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM