简体   繁体   English

将函数应用于 data.table 子集中的所有值

[英]apply function to all values in data.table subset

I have a pairwise table of values, and I'm trying to find the fastest way to apply some function to various subsets of this table.我有一个成对的值表,我试图找到将某些函数应用于该表的各个子集的最快方法。 I'm experimenting with data.table to see if it will suit my needs.我正在试验 data.table 以查看它是否适合我的需求。

For example, I start with this vector of data points, which I convert to a pairwise distance matrix.例如,我从这个数据点向量开始,我将其转换为成对距离矩阵。

dat <- c(spA = 4, spB = 10, spC = 8, spD = 1, spE = 5, spF = 9)
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

It looks like this:它看起来像这样:

> pdist
    spA spB spC spD spE spF
spA  NA  NA  NA  NA  NA  NA
spB   6  NA  NA  NA  NA  NA
spC   4   2  NA  NA  NA  NA
spD   3   9   7  NA  NA  NA
spE   1   5   3   4  NA  NA
spF   5   1   1   8   4  NA

Converting this table to a data.table将此表转换为 data.table

library(data.table)
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)
> pdist
    rn spA spB spC spD spE spF
1: spA  NA  NA  NA  NA  NA  NA
2: spB   6  NA  NA  NA  NA  NA
3: spC   4   2  NA  NA  NA  NA
4: spD   3   9   7  NA  NA  NA
5: spE   1   5   3   4  NA  NA
6: spF   5   1   1   8   4  NA

If I have some subset that I want to extract the values for,如果我有一些我想为其提取值的子集,

sub <- c('spB', 'spF', 'spD')

I can do the following, which yields the submatrix that I am interested in:我可以执行以下操作,从而生成我感兴趣的子矩阵:

> pdist[.(sub), sub, with=FALSE]
       spB spF spD
    1:  NA  NA  NA
    2:   1  NA   8
    3:   9  NA  NA

Now, how can I apply a function, for example taking the mean (but potentially a custom function), of all values in this subset?现在,我如何应用一个函数,例如取这个子集中所有值的平均值(但可能是一个自定义函数)? I can do it this way, but I wonder if there are better ways in line with data.table manipulation.我可以这样做,但我想知道是否有更好的方法符合 data.table 操作。

> mean(unlist(pdist[.(sub), sub, with=FALSE]), na.rm=TRUE)
[1] 6

UPDATE更新

Following up on this, I decided to see how different in performance a matrix vs a data.table approach would be:在此之后,我决定看看矩阵与 data.table 方法在性能上的不同之处:

dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)

spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)

# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
setkey(pdistDT, sp)

matMethod <- function(pdist, sub) {
    return(mean(pdist[sub, sub], na.rm=TRUE))
}

dtMethod <- function(pdistDT, sub) {
    return(mean(unlist(pdistDT[.(sub), sub, with=FALSE]), na.rm=TRUE))
}


> system.time(q1 <- lapply(spSub, function(x) matMethod(pdist, x)))
   user  system elapsed 
 18.116   0.154  18.317 

> system.time(q2 <- lapply(spSub, function(x) dtMethod(pdistDT, x)))
   user  system elapsed 
795.456  13.357 806.820 

It appears that going through the data.table step here is leading to a big performance cost.似乎在这里执行 data.table 步骤会导致很大的性能成本。

Please see the solution posted here for an every more general solution.请参阅此处发布的解决方案以获取更通用的解决方案。 It may also help: data.table: transforming subset of columns with a function, row by row它也可能有帮助: data.table:用函数逐行转换列的子集

To apply the function, you can do the following:要应用该功能,您可以执行以下操作:

Part 1. A Step-by-Step Solution第 1 部分。 分步解决方案

(1.a) Get the data into Data.Table format: (1.a) 将数据转成Data.Table格式:

library(data.table)
library(magrittr) #for access to pipe operator
pdist <- as.data.table(pdist, keep.rownames=TRUE)
setkey(pdist, rn)

(1.b) Then, Get the list of Column Names: (1.b) 然后,获取列名列表:

# Get the list of names
sub <- c('spB', 'spF', 'spD')

(1.c) Define the function you want to apply (1.c) 定义你要应用的功能

#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:

normalize <- function(X, X.mean = mean(X, na.rm=T), X.sd = sd(X, na.rm=T)){
                          X <- (X - X.mean) / X.sd
                          return(X)}

(1.d) Apply the function: (1.d) 应用函数:

# Voila: 
pdist[, unlist(.SD, use.names = FALSE), .SDcols = sub] %>% normalize() 

#Or, you can apply the function inside the [], as below: 
pdist[, unlist(.SD, use.names = FALSE) %>% normalize(), .SDcols = sub] 

# Or, if you prefer to do it without the pipe operator:
pdist[, normalize(unlist(.SD, use.names = FALSE)), .SDcols = sub] 

Part 2. Some Advantages for Data.Table approach第 2 部分。 Data.Table 方法的一些优势

Since you seem familiar with matrix approach, I just wanted to point out some advantages of keeping the data.table approach由于您似乎熟悉矩阵方法,我只想指出保留 data.table 方法的一些优点

(2.a) Apply functions within group by using the "by =" (2.a) 使用“by =”在组内应用函数

One advantage over matrix is that you can still apply functions within group by using the "by =" argument.与矩阵相比的一个优势是您仍然可以通过使用“by =”参数在组内应用函数。

In the example here, I assume you have a variable called "Grp."在这里的示例中,我假设您有一个名为“Grp”的变量。

With the by=Grp line, the normalization is within group now.使用by=Grp行,标准化现在在组内。

pdist[, unlist(.SD) %>% normalize(), .SDcols = sub, by=Grp]

(2.b) Another advantage is that you can keep other identifying information, for example, if each row has a "participant identifier" P.Id that you wish to keep and repeat: (2.b) 另一个优点是您可以保留其他识别信息,例如,如果每一行都有一个您希望保留并重复的“参与者标识符”P.Id:

pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]

In the first step, done in this portion of the code: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]第一步,在这部分代码中完成: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id]

  1. First, we create a new column called Combined.Data for data in all three columns identified in "sub"首先,我们为“sub”中标识的所有三列中的数据创建一个名为Combined.Data的新列
  2. Next to each row of the combined data, the appropriate Participant Id will repeat in column P.Id在组合数据的每一行旁边,适当的参与者 ID 将在列P.Id 中重复

In the second step, done in this portion of the code: [,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]第二步,在这部分代码中完成: [,.(P.Id, Normalized = normalize(Combined.Data), Combined.Data)]

  1. We can create a new column called Normalized to store the normalized values that result from applying the function normalize()我们可以创建一个名为Normalized的新列来存储应用函数normalize()产生的归一化值
  2. In addition, we can also include the Combined.Data column as well此外,我们还可以包括 Combined.Data 列

So, with this single line: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data), Combined.Data)]所以,用这一行: pdist[, .(Combined.Data = unlist(.SD)), .SDcols = sub, by=P.Id][order(P.Id),.(P.Id, Transformed = normalize(Combined.Data),Combined.Data)]

  • we subset columns,我们对列进行子集化,
  • collapse data across the subset,折叠子集的数据,
  • keep track of the identifier for each datum (P.Id) even when collapsed,即使在折叠时也要跟踪每个数据 (P.Id) 的标识符,
  • apply a transformation on the entire collapsed data, and对整个折叠数据应用转换,以及
  • end-up with a neat output in the form of a data table with 3 columns: (1) P.Id, (2) Transformed, & (3) Combined.Data (original values).最终以具有 3 列的数据表形式的整洁输出:(1) P.Id,(2) Transformed,&(3) Combined.Data(原始值)。
  • and, the order(P.Id) allows the output to appear meaningfully ordered.并且, order(P.Id)允许输出显示为有意义的排序。

The same would be possible with matrix approach, but would be much more cumbersome and take more lines of code.矩阵方法也可以这样做,但会更麻烦并且需要更多的代码行。

Data table allows for powerful manipulation and management of data, especially when you start chaining operations together.数据表允许对数据进行强大的操作和管理,尤其是当您开始将操作链接在一起时。

(2.c) Finally, if you just wish to keep row information as simple row.numbers, you can use the .I feature of the data.table package: (2.c) 最后,如果您只是希望将行信息保留为简单的 row.numbers,则可以使用 data.table 包的 .I 功能:

pdist[, .(.I, normalize(unlist(.SD)), .SDcols = sub]

This feature can be quite helpful, especially if you dont have a participant or row identifier that is inherently meaningful.此功能非常有用,特别是如果您没有本质上有意义的参与者或行标识​​符。

Part 3. Disadvantage: Time Cost第 3 部分。 缺点:时间成本

I recreated the corrected time cost shown above and the solution for Data Table does take significantly longer我重新创建了上面显示的更正时间成本,数据表的解决方案确实需要更长的时间

dat <- runif(1000)
names(dat) <- paste0('sp', 1:1000)

spSub <- replicate(10000, sample(names(dat), 100), simplify=TRUE)

# calculate pairwise distance matrix
pdist <- as.matrix(dist(dat))
pdist[upper.tri(pdist, diag = TRUE)] <- NA

# convert to data.table
pdistDT <- as.data.table(pdist, keep.rownames='sp')
# pdistDT$sp %<>% as.factor()
setkey(pdistDT, sp)


matMethod <- function(pdist, sub) {
  return(mean(pdist[sub, sub], na.rm=TRUE))
}


dtMethod <- function(pdistDT, sub) {
  return(pdistDT[sub, sub, with = FALSE] %>% 
           unlist(., recursive = FALSE, use.names = FALSE) %>% 
           mean(., na.rm = TRUE))
}


dtMethod1 <- function(pdistDT, sub) {
  return(pdistDT[sub, sub, with = FALSE] %>% 
           melt.data.table(., measure.vars = sub, na.rm=TRUE) %$% 
           mean(value))
}


system.time(q1 <- apply(spSub, MARGIN = 2, function(x) matMethod(pdist, x)))
# user  system elapsed 
# 2.86    0.00    3.27 

system.time(q2 <- apply(spSub, MARGIN = 2, function(x) dtMethod(pdistDT, x)))
# user  system elapsed 
# 57.20    0.02   57.23 

system.time(q3 <- apply(spSub, MARGIN = 2, function(x) dtMethod1(pdistDT, x)))
# user  system elapsed 
# 62.78    0.06   62.91 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM