[英]apply function to all values in data.table subset
I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table.我有一个成对的值表,我试图找到将某些函数应用于该表的各个子集的最快方法。 I'm experimenting with data.table to see if it will suit my needs.
我正在试验 data.table 以查看它是否适合我的需求。
For example, I start with this vector of data points, which I convert to a pairwise distance matrix.例如,我从这个数据点向量开始,我将其转换为成对距离矩阵。
dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
It looks like this:它看起来像这样:
> pdist
spA spB spC spD spE spF
spA NA NA NA NA NA NA
spB 6 NA NA NA NA NA
spC 4 2 NA NA NA NA
spD 3 9 7 NA NA NA
spE 1 5 3 4 NA NA
spF 5 1 1 8 4 NA
Converting this table to a data.table将此表转换为 data.table
library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
rn spA spB spC spD spE spF
1: spA NA NA NA NA NA NA
2: spB 6 NA NA NA NA NA
3: spC 4 2 NA NA NA NA
4: spD 3 9 7 NA NA NA
5: spE 1 5 3 4 NA NA
6: spF 5 1 1 8 4 NA
If I have some subset that I want to extract the values for,如果我有一些我想为其提取值的子集,
sub <- c('spB', 'spF', 'spD')
I can do the following, which yields the submatrix that I am interested in:我可以执行以下操作,从而生成我感兴趣的子矩阵:
> pdist[.(sub), sub, with=FALSE]
spB spF spD
1: NA NA NA
2: 1 NA 8
3: 9 NA NA
Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset?现在,我如何应用一个函数,例如取这个子集中所有值的平均值(但可能是一个自定义函数)? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.
我可以这样做,但我想知道是否有更好的方法符合 data.table 操作。
> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6
UPDATE更新
Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:在此之后,我决定看看矩阵与 data.table 方法在性能上的不同之处:
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}
> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
user system elapsed
18.116 0.154 18.317
> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
user system elapsed
795.456 13.357 806.820
It appears that going through the data.table step here is leading to a big performance cost.似乎在这里执行 data.table 步骤会导致很大的性能成本。
Please see the solution posted here for an every more general solution.请参阅此处发布的解决方案以获取更通用的解决方案。 It may also help: data.table: transforming subset of columns with a function, row by row
它也可能有帮助: data.table:用函数逐行转换列的子集
To apply the function, you can do the following:要应用该功能,您可以执行以下操作:
library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
# Get the list of names
sub <- c('spB', 'spF', 'spD')
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
X <- (X - X.mean) / X.sd
return(X)}
# Voila:
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize()
#Or, you can apply the function inside the [], as below:
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub]
# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub]
Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach由于您似乎熟悉矩阵方法,我只想指出保留 data.table 方法的一些优点
One advantage over matrix is that you can still apply functions within group by using the "by =" argument.与矩阵相比的一个优势是您仍然可以通过使用“by =”参数在组内应用函数。
In the example here, I assume you have a variable called "Grp."在这里的示例中,我假设您有一个名为“Grp”的变量。
With the by=Grp
line, the normalization is within group now.使用
by=Grp
行,标准化现在在组内。
pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]
In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
第一步,在这部分代码中完成:
pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]
In the second step, done in this portion of the code: [,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
第二步,在这部分代码中完成:
[,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]
normalize()
normalize()
产生的归一化值So, with this single line: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]所以,用这一行: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data),Combined.Data)]
order(P.Id)
allows the output to appear meaningfully ordered.order(P.Id)
允许输出显示为有意义的排序。 The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.矩阵方法也可以这样做,但会更麻烦并且需要更多的代码行。
Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.数据表允许对数据进行强大的操作和管理,尤其是当您开始将操作链接在一起时。
pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]
This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.此功能非常有用,特别是如果您没有本质上有意义的参与者或行标识符。
I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer我重新创建了上面显示的更正时间成本,数据表的解决方案确实需要更长的时间
dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)
spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)
# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA
# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)
matMethod <- function(pdist, sub) {
return(mean(pdist[sub, sub], na.rm=TRUE))
}
dtMethod <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
unlist(., recursive = FALSE, use.names = FALSE) %>%
mean(., na.rm = TRUE))
}
dtMethod1 <- function(pdistDT, sub) {
return(pdistDT[sub, sub, with = FALSE] %>%
melt.data.table(., measure.vars = sub, na.rm=TRUE) %$%
mean(value))
}
system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user system elapsed
# 2.86 0.00 3.27
system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user system elapsed
# 57.20 0.02 57.23
system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user system elapsed
# 62.78 0.06 62.91
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.