![](/img/trans.png)
[英]R data.table operations with multiple groups in single data.table and outside function with lapply
[英]slow function by groups in data.table r
我的實驗設計有在各種森林中測量的樹木,並在多年中重復測量。
DT <- data.table(forest=rep(c("a","b"),each=6),
year=rep(c("2000","2010"),each=3),
id=c("1","2","3"),
size=(1:12))
DT[,id:=paste0(forest,id)]
> DT
forest year id size
1: a 2000 a1 1
2: a 2000 a2 2
3: a 2000 a3 3
4: a 2010 a1 4
5: a 2010 a2 5
6: a 2010 a3 6
7: b 2000 b1 7
8: b 2000 b2 8
9: b 2000 b3 9
10: b 2010 b1 10
11: b 2010 b2 11
12: b 2010 b3 12
對於每棵樹i,我想計算一個新變量,該變量等於該組/年中所有其他個體的總和大於樹i。
我創建了以下功能:
f.new <- function(i,n){
DT[forest==DT[id==i, unique(forest)] & year==n # select the same forest & year of the tree i
& size>DT[id==i & year==n, size], # select the trees larger than the tree i
sum(size, na.rm=T)] # sum the sizes of all such selected trees
}
當在數據表中應用時,我得到了正確的結果。
DT[,new:=f.new(id,year), by=.(id,year)]
> DT
forest year id size new
1: a 2000 a1 1 5
2: a 2000 a2 2 3
3: a 2000 a3 3 0
4: a 2010 a1 4 11
5: a 2010 a2 5 6
6: a 2010 a3 6 0
7: b 2000 b1 7 17
8: b 2000 b2 8 9
9: b 2000 b3 9 0
10: b 2010 b1 10 23
11: b 2010 b2 11 12
12: b 2010 b3 12 0
請注意,我有一個大型的數據集,其中包含幾個森林(40)和重復的年份(6)和一個人(20,000),總共進行了50,000次測量。 當我執行上述功能時,需要8-10分鍾(Windows 7,i5-6300U CPU @ 2.40 GHz 2.40 GHz,RAM 8 GB)。 我需要經常進行幾次細微的修改,並且要花很多時間。
只需對數據進行排序,這可能會非常快:
setorder(DT, forest, year, -size)
DT[, new := cumsum(size) - size, by = .(forest, year)]
setorder(DT, forest, year, id)
DT
# forest year id size new
# 1: a 2000 a1 1 5
# 2: a 2000 a2 2 3
# 3: a 2000 a3 3 0
# 4: a 2010 a1 4 11
# 5: a 2010 a2 5 6
# 6: a 2010 a3 6 0
# 7: b 2000 b1 7 17
# 8: b 2000 b2 8 9
# 9: b 2000 b3 9 0
#10: b 2010 b1 10 23
#11: b 2010 b2 11 12
#12: b 2010 b3 12 0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.