I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table. I'm experimenting with data.table to see if it will suit my needs.
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.
Please see the solution posted here for an every more general solution. It may also help: data.table: transforming subset of columns with a function, row by row
To apply the function, you can do the following:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
# Get the list of names
sub <- c('spB', 'spF', 'spD')
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.
In the example here, I assume you have a variable called "Grp."
With the by=Grp
line, the normalization is within group now.
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
In the second step, done in this portion of the code: [,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
normalize()
So, with this single line: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
order(P.Id)
allows the output to appear meaningfully ordered.The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.