简体   繁体   中英

using column numbers for grouping in data table rather than names in R

I have code that needs to be flexible, and I cannot hard code in column names when I do grouping. As such, I want to hard code column numbers to do grouping, since these are easy to specify over range changes. (Column 1 through X or so, rather than using the names of cols 1,2,..X)

Example data set:

set.seed(007) 
DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))



 DF
    X Y Z Q
1   1 1 3 4
2   2 0 1 2
3   3 0 5 4
4   4 0 5 2
5   5 0 5 5
6   6 1 0 1
7   7 0 3 0
8   8 1 2 4
9   9 0 5 5
10 10 0 2 5
11 11 0 4 3
12 12 0 1 4
13 13 1 1 4
14 14 0 1 3
15 15 0 2 4
16 16 0 5 2
17 17 1 2 0
18 18 0 4 1
19 19 1 5 2
20 20 0 2 1

A grouping (by Z and Q) that finds the X that maximizes Y, and returns both:

    DF =data.table(DF)
    DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]

Result:

        DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
    Z Q Y  X
 1: 3 4 1  1
 2: 1 2 0  2
 3: 5 4 0  3
 4: 5 2 1 19
 5: 5 5 0  5
 6: 0 1 1  6
 7: 3 0 0  7
 8: 2 4 1  8
 9: 2 5 0 10
10: 4 3 0 11
11: 1 4 1 13
12: 1 3 0 14
13: 2 0 1 17
14: 4 1 0 18
15: 2 1 0 20

I want to do this purely using column numbers, because of the nature of my code. Additionally, If there were another column, I would potentially want to group by that extra column. And I would also want to potentially return another argmax in the first part.

Maybe just pick off names(DF) with column numbers, combined with eval(parse(...)) ?

useColNums <- function(data, a, b) {
  n <- names(data) 
  y <- n[a[1]]
  x <- n[a[2]]
  groupby <- sprintf("list(%s)", paste(n[b], collapse=","))
  argmax <-  sprintf("list(%1$s=max(%1$s),%2$s=%2$s[which.max(%1$s)])", y, x)
  data[, eval(parse(text=argmax)), by=eval(parse(text=groupby))]  
}

x <- useColNums(DF, 2:1, 3:4)
y <- DF[, list(Y=max(Y),X=X[which.max(Y)]), by=list(Z, Q)]
identical(x, y)
# [1] TRUE

Did you find an answer that works for you? Something like this is possible, but it is not pretty, which may mean it is hard to maintain:

DF[, list(Y=max(eval(as.symbol(colnames(DF)[2]))),
          X=eval(as.symbol(colnames(DF)[1]))[which.max(eval(as.symbol(colnames(DF)[2])))]),
          by=list(Z=eval(as.symbol(colnames(DF)[3])),
                  Q=eval(as.symbol(colnames(DF)[4])))]

Now you could put those as.symbol(colnames()) into a function and make this easier to read:

cn <- function( dt, col ) { as.symbol(colnames(dt)[col]) }

DF[, list(Y=max(eval(cn(DF,2))),
          X=eval(cn(DF,1))[which.max(eval(cn(DF,2)))]),
          by=list(Z=eval(cn(DF,3)), Q=eval(cn(DF,4)))]

Does this solve that problem of grouping by column numbers for you?

You could use a combination of grep with your code:

> set.seed(007) 
> DF <- data.frame(X=1:20, Y=sample(c(0,1), 20, TRUE), Z=sample(0:5, 20, TRUE), Q =sample(0:5, 20, TRUE))
> DF = data.table(DF)
> coly <- na
> DF[, list(Y=max(Y),X=X[which.max(Y)]), by=c(col1 <- names(DF)[grep("Q", colnames(DF))], names(DF)[grep("Z", colnames(DF))])]
    Q Z Y  X
 1: 4 3 1  1
 2: 2 1 0  2
 3: 4 5 0  3
 4: 2 5 1 19
 5: 5 5 0  5
 6: 1 0 1  6
 7: 0 3 0  7
 8: 4 2 1  8
 9: 5 2 0 10
10: 3 4 0 11
11: 4 1 1 13
12: 3 1 0 14
13: 0 2 1 17
14: 1 4 0 18
15: 1 2 0 20 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM