简体   繁体   中英

apply function to all values in data.table subset

I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.

For example, I start with this vector of data points, which I convert to a pairwise distance matrix.

dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

It looks like this:

> pdist
    spA spB spC spD spE spF
spA  NA  NA  NA  NA  NA  NA
spB   6  NA  NA  NA  NA  NA
spC   4   2  NA  NA  NA  NA
spD   3   9   7  NA  NA  NA
spE   1   5   3   4  NA  NA
spF   5   1   1   8   4  NA

Converting this table to a data.table

library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
    rn spA spB spC spD spE spF
1: spA  NA  NA  NA  NA  NA  NA
2: spB   6  NA  NA  NA  NA  NA
3: spC   4   2  NA  NA  NA  NA
4: spD   3   9   7  NA  NA  NA
5: spE   1   5   3   4  NA  NA
6: spF   5   1   1   8   4  NA

If I have some subset that I want to extract the values for,

sub <- c('spB', 'spF', 'spD')

I can do the following, which yields the submatrix that I am interested in:

> pdist[.(sub), sub, with=FALSE]
       spB spF spD
    1:  NA  NA  NA
    2:   1  NA   8
    3:   9  NA  NA

Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.

> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6

UPDATE

Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:

dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)

spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)

# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)

matMethod <- function(pdist, sub) {
    return(mean(pdist[sub, sub], na.rm=TRUE))
}

dtMethod <- function(pdistDT, sub) {
    return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}


> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
   user  system elapsed 
 18.116   0.154  18.317 

> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
   user  system elapsed 
795.456  13.357 806.820 

It appears that going through the data.table step here is leading to a big performance cost.

Please see the solution posted here for an every more general solution. It may also help: data.table: transforming subset of columns with a function, row by row

To apply the function, you can do the following:

Part 1. A Step-by-Step Solution

(1.a) Get the data into Data.Table format:

library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)

(1.b) Then, Get the list of Column Names:

# Get the list of names
sub <- c('spB', 'spF', 'spD')

(1.c) Define the function you want to apply

#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:

normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
                          X <- (X - X.mean) / X.sd
                          return(X)}

(1.d) Apply the function:

# Voila: 
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize() 

#Or, you can apply the function inside the [], as below: 
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub] 

# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub] 

Part 2. Some Advantages for Data.Table approach

Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach

(2.a) Apply functions within group by using the "by ="

One advantage over matrix is that you can still apply functions within group by using the "by =" argument.

In the example here, I assume you have a variable called "Grp."

With the by=Grp line, the normalization is within group now.

pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]

(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat:

pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]

In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]

  1. First, we create a new column called Combined.Data for data in all three columns identified in "sub"
  2. Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id

In the second step, done in this portion of the code: [,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]

  1. We can create a new column called Normalized to store the normalized values that result from applying the function normalize()
  2. In addition, we can also include the Combined.Data column as well

So, with this single line: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]

  • we subset columns,
  • collapse data across the subset,
  • keep track of the identifier for each datum (P.Id) even when collapsed,
  • apply a transformation on the entire collapsed data, and
  • end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).
  • and, the order(P.Id) allows the output to appear meaningfully ordered.

The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.

Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.

(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package:

pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]

This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.

Part 3. Disadvantage: Time Cost

I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer

dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)

spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)

# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)


matMethod <- function(pdist, sub) {
  return(mean(pdist[sub, sub], na.rm=TRUE))
}


dtMethod <- function(pdistDT, sub) {
  return(pdistDT[sub, sub, with = FALSE] %>% 
           unlist(., recursive = FALSE, use.names = FALSE) %>% 
           mean(., na.rm = TRUE))
}


dtMethod1 <- function(pdistDT, sub) {
  return(pdistDT[sub, sub, with = FALSE] %>% 
           melt.data.table(., measure.vars = sub, na.rm=TRUE) %$% 
           mean(value))
}


system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user  system elapsed 
# 2.86    0.00    3.27 

system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user  system elapsed 
# 57.20    0.02   57.23 

system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user  system elapsed 
# 62.78    0.06   62.91 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM