[英]Apply function by row in data.table using columns as arguments
I am trying to apply a function by row using data.table with columns as arguments. 我正在尝试使用data.table以列作为参数逐行应用函数。 I am currently using apply as suggested here 我目前正在按照这里的建议使用Apply
However, my data.table is 27 million rows with 7 columns so the apply operation takes a very long time when I run it recursively on many input files, the job takes up all available RAM (32Gb). 但是,我的data.table是2千7百万行,有7列,因此当我在许多输入文件上递归运行它时,应用操作将花费很长时间,该作业占用了所有可用的RAM(32Gb)。 It's likely that I am copying the data.table multiple times, though I'm not sure about that. 尽管我不确定,但我可能多次复制了data.table。
I would like help making this code more memory efficient given that each input file will be ~30 million rows by 7 columns and there are 30 input files to process. 鉴于每个输入文件将由7列组成约3000万行,并且有30个输入文件需要处理,因此我希望帮助提高此代码的存储效率。 I am fairly sure that the lines using apply are slowing down the whole code so alternatives that are more memory efficient or use vectorized functions would probably be better options. 我相当确定使用apply的行会减慢整个代码的速度,因此内存效率更高或使用矢量化函数的替代方法可能是更好的选择。
I've had a lot of trouble trying to write a vectorized function that takes in 4 columns as arguments and operates on a row by row basis, using data.table. 我在编写向量化函数时遇到很多麻烦,该函数采用4列作为参数,并使用data.table在逐行的基础上进行操作。 The apply solution in my example code works but it's very slow. 我的示例代码中的套用解决方案有效,但是速度很慢。 One alternative I tried is: 我尝试过的一种替代方法是:
cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]
but the first 6 rows of the datU data.table output look like this: 但是datU data.table输出的前6行如下所示:
Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC
Here is my code using apply that works (it produced the high1 column above), but is too slow and memory intensive: 这是我使用的代码,它可以正常工作(它产生了上面的high1列),但是太慢且占用大量内存:
#Get input files from top directory, searching through all subdirectories
file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)
#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values
savelist=NULL
for (i in file_list) {
datU <- fread(i)
name=dirname(i)
#Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
max1 <- maxn(1)
max2 <- maxn(2)
colNum=c(4,5,6,7)
datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
datU[,difference:=high1-high2,by=1:nrow(datU)]
datU[,folder:=name]
savelist[[i]]<-datU
}
#Create loop to iterate over folders and output data
sigout=NULL
for (i in savelist) {
# Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]
df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)
#Output values
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}
I would love some examples concerning alternative functions to apply that will give me the highest and second highest values for each row for columns 4,5,6,7 which can be identified by index or alternatively by column name. 我喜欢一些与替代功能有关的示例,这些示例将为我提供第4、5、6、7列每行的最高值和第二最高值,这些值可以通过索引或列名来标识。
Thank you! 谢谢!
You could do something like this: 您可以执行以下操作:
DF <- read.table(text = " Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC", header = TRUE)
library(data.table)
setDT(DF)
maxTwo <- function(x) {
ind <- length(x) - (1:0) #the index is equal for all rows,
#so it could be made a function parameter
#for better efficiency
as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}
DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)),
by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]
# Cycle Tab ID colA colB colC colG high1 high1a max1 max2 diffMax
#1: 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057
#2: 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795
#3: 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157
#4: 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049
#5: 5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156
#6: 6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911
However, you'd still be looping over the rows, which means nrow
calls to the function. 但是,您仍将循环遍历所有行,这意味着nrow
函数进行nrow
调用。 You could try Rcpp to do the looping in compiled code. 您可以尝试使用Rcpp在已编译的代码中进行循环。
Depending on how you want to deal with duplicates, eg if you don't have them or want to group them together, you could do: 根据您要如何处理重复项,例如,如果您没有重复项或要将它们分组在一起,则可以执行以下操作:
d = data.table(a = 1:4, b = 4:1, c = c(2,1,1,4))
# a b c
#1: 1 4 2
#2: 2 3 1
#3: 3 2 1
#4: 4 1 4
high1 = do.call(pmax, d)
#[1] 4 3 3 4
high2 = do.call(pmax, d * (d != high1))
#[1] 2 2 2 1
Otherwise, you could just add some jitter out of the scope of your precision (I chose a large amount to keep it visible): 否则,您可以在精度范围之外添加一些抖动(我选择了很多抖动使其可见):
d.jitter = d + runif(nrow(d) * ncol(d), 0, 1e-4)
# a b c
#1: 1.000044 4.000090 2.000008
#2: 2.000076 3.000029 1.000034
#3: 3.000007 2.000029 1.000036
#4: 4.000001 1.000069 4.000041
high1.j = do.call(pmax, d.jitter)
high2 = do.call(pmax, d * (d.jitter != high1.j))
#[1] 2 2 2 4
Translation to relevant .SD
and .SDcols
semantics is left as a simple exercise to the reader. .SD
.SDcols
将相关的.SD
和.SDcols
语义进行翻译。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.