使用列作为参数在data.table中逐行应用函数

Question

I am trying to apply a function by row using data.table with columns as arguments. 我正在尝试使用data.table以列作为参数逐行应用函数。 I am currently using apply as suggested here 我目前正在按照这里的建议使用Apply

However, my data.table is 27 million rows with 7 columns so the apply operation takes a very long time when I run it recursively on many input files, the job takes up all available RAM (32Gb). 但是，我的data.table是2千7百万行，有7列，因此当我在许多输入文件上递归运行它时，应用操作将花费很长时间，该作业占用了所有可用的RAM（32Gb）。 It's likely that I am copying the data.table multiple times, though I'm not sure about that. 尽管我不确定，但我可能多次复制了data.table。

I would like help making this code more memory efficient given that each input file will be ~30 million rows by 7 columns and there are 30 input files to process. 鉴于每个输入文件将由7列组成约3000万行，并且有30个输入文件需要处理，因此我希望帮助提高此代码的存储效率。 I am fairly sure that the lines using apply are slowing down the whole code so alternatives that are more memory efficient or use vectorized functions would probably be better options. 我相当确定使用apply的行会减慢整个代码的速度，因此内存效率更高或使用矢量化函数的替代方法可能是更好的选择。

I've had a lot of trouble trying to write a vectorized function that takes in 4 columns as arguments and operates on a row by row basis, using data.table. 我在编写向量化函数时遇到很多麻烦，该函数采用4列作为参数，并使用data.table在逐行的基础上进行操作。 The apply solution in my example code works but it's very slow. 我的示例代码中的套用解决方案有效，但是速度很慢。 One alternative I tried is: 我尝试过的一种替代方法是：

cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]

but the first 6 rows of the datU data.table output look like this: 但是datU data.table输出的前6行如下所示：

    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1   0   45513   -233.781    -84.087 -3.141  3740.916    3740.916    colC
2   0   45513   -103.561    -347.382    2900.866    357.071 2900.866    colC
3   0   45513   153.383 4036.636    353.479 -42.736 4036.636    colC
4   0   45513   -147.941    28.994  4354.994    384.945 4354.994    colC
5   0   45513   -89.719 -504.643    1298.476    131.32  1298.476    colC
6   0   45513   -250.11 -30.862 1877.049    -184.772    1877.049    colC

Here is my code using apply that works (it produced the high1 column above), but is too slow and memory intensive: 这是我使用的代码，它可以正常工作（它产生了上面的high1列），但是太慢且占用大量内存：

#Get input files from top directory, searching through all subdirectories
    file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)

#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values

    savelist=NULL
    for (i in file_list) {

    datU <- fread(i)
    name=dirname(i)

    #Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
    maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
    max1 <- maxn(1)
    max2 <- maxn(2)
    colNum=c(4,5,6,7)
    datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
    datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
    datU[,difference:=high1-high2,by=1:nrow(datU)]
    datU[,folder:=name]
    savelist[[i]]<-datU

}

#Create loop to iterate over folders and output data

sigout=NULL
for (i in savelist) {

   # Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]

df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)

#Output values 
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}

I would love some examples concerning alternative functions to apply that will give me the highest and second highest values for each row for columns 4,5,6,7 which can be identified by index or alternatively by column name. 我喜欢一些与替代功能有关的示例，这些示例将为我提供第4、5、6、7列每行的最高值和第二最高值，这些值可以通过索引或列名来标识。

Thank you! 谢谢！

Answer 1

You could do something like this: 您可以执行以下操作：

DF <- read.table(text = "    Cycle   Tab ID  colA    colB    colC    colG    high1   high1a
1   0   45513   -233.781    -84.087 -3.141  3740.916    3740.916    colC
                 2   0   45513   -103.561    -347.382    2900.866    357.071 2900.866    colC
                 3   0   45513   153.383 4036.636    353.479 -42.736 4036.636    colC
                 4   0   45513   -147.941    28.994  4354.994    384.945 4354.994    colC
                 5   0   45513   -89.719 -504.643    1298.476    131.32  1298.476    colC
                 6   0   45513   -250.11 -30.862 1877.049    -184.772    1877.049    colC", header = TRUE)

library(data.table)
setDT(DF)

maxTwo <- function(x) {
  ind <- length(x) - (1:0) #the index is equal for all rows,
                           #so it could be made a function parameter
                           #for better efficiency
  as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}

DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)), 
    by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]

#   Cycle Tab    ID     colA     colB     colC     colG    high1 high1a    max1     max2  diffMax
#1:     1   0 45513 -233.781  -84.087   -3.141 3740.916 3740.916   colC  -3.141 3740.916 3744.057
#2:     2   0 45513 -103.561 -347.382 2900.866  357.071 2900.866   colC 357.071 2900.866 2543.795
#3:     3   0 45513  153.383 4036.636  353.479  -42.736 4036.636   colC 353.479 4036.636 3683.157
#4:     4   0 45513 -147.941   28.994 4354.994  384.945 4354.994   colC 384.945 4354.994 3970.049
#5:     5   0 45513  -89.719 -504.643 1298.476  131.320 1298.476   colC 131.320 1298.476 1167.156
#6:     6   0 45513 -250.110  -30.862 1877.049 -184.772 1877.049   colC -30.862 1877.049 1907.911

However, you'd still be looping over the rows, which means nrow calls to the function. 但是，您仍将循环遍历所有行，这意味着nrow函数进行nrow调用。 You could try Rcpp to do the looping in compiled code. 您可以尝试使用Rcpp在已编译的代码中进行循环。

Answer 2

Depending on how you want to deal with duplicates, eg if you don't have them or want to group them together, you could do: 根据您要如何处理重复项，例如，如果您没有重复项或要将它们分组在一起，则可以执行以下操作：

d = data.table(a = 1:4, b = 4:1, c = c(2,1,1,4))
#   a b c
#1: 1 4 2
#2: 2 3 1
#3: 3 2 1
#4: 4 1 4

high1 = do.call(pmax, d)
#[1] 4 3 3 4
high2 = do.call(pmax, d * (d != high1))
#[1] 2 2 2 1

Otherwise, you could just add some jitter out of the scope of your precision (I chose a large amount to keep it visible): 否则，您可以在精度范围之外添加一些抖动（我选择了很多抖动使其可见）：

d.jitter = d + runif(nrow(d) * ncol(d), 0, 1e-4)
#          a        b        c
#1: 1.000044 4.000090 2.000008
#2: 2.000076 3.000029 1.000034
#3: 3.000007 2.000029 1.000036
#4: 4.000001 1.000069 4.000041

high1.j = do.call(pmax, d.jitter)
high2 = do.call(pmax, d * (d.jitter != high1.j))
#[1] 2 2 2 4

Translation to relevant .SD and .SDcols semantics is left as a simple exercise to the reader. .SD .SDcols将相关的.SD和.SDcols语义进行翻译。

使用列作为参数在data.table中逐行应用函数

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-06-08 19:35:01

解决方案2
1 2015-06-08 20:09:58

使用列作为参数在data.table中逐行应用函数

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-06-08 19:35:01

解决方案2 1 2015-06-08 20:09:58

解决方案1
2 已采纳 2015-06-08 19:35:01

解决方案2
1 2015-06-08 20:09:58