[英]Identifying the highest correlations values between independent variables and excluding the lowest correlation with the dependent variable in r
[英]R: efficiently identifying highest N values of variable Z by group X
我有一個看起來像這樣的數據表。
ID <- c(rep("ABC",4), rep("DEF",4), rep("GHI",5))
X <- c(rep(c(1,2,3,4),3),5)
set.seed(1234)
Z <- runif(13,min=0, max =1)
a <- data.table(ID, X, Z)
a
ID X Z
1: ABC 1 0.113703411
2: ABC 2 0.622299405
3: ABC 3 0.609274733
4: ABC 4 0.623379442
5: DEF 1 0.860915384
6: DEF 2 0.640310605
7: DEF 3 0.009495756
8: DEF 4 0.232550506
9: GHI 1 0.666083758
10: GHI 2 0.514251141
11: GHI 3 0.693591292
12: GHI 4 0.544974836
13: GHI 5 0.282733584
我想產生一個在每個X子組中具有Z的N個最高值的數據框。 因此,假設N為2。我想得到一個看起來像這樣的數據集:
X ID Z
1: 1 DEF 0.8609154
2: 1 GHI 0.6660838
3: 2 DEF 0.6403106
4: 2 ABC 0.6222994
5: 3 GHI 0.6935913
6: 3 ABC 0.6092747
7: 4 ABC 0.6233794
8: 4 GHI 0.5449748
9: 5 GHI 0.2827336
我一直在使用此行來實現它,但是當數據表很大時(即超過1,500,000行或更多),我發現它特別慢。
top_n <- 2
a <- a[order(a$X, -a$Z),]
a_2 <- a[, head(.SD, top_n), by=X]
a_2
X ID Z
1: 1 DEF 0.8609154
2: 1 GHI 0.6660838
3: 2 DEF 0.6403106
4: 2 ABC 0.6222994
5: 3 GHI 0.6935913
6: 3 ABC 0.6092747
7: 4 ABC 0.6233794
8: 4 GHI 0.5449748
9: 5 GHI 0.2827336
非常感激任何的幫助!
謝謝!
這應該比.SD
更快
n <- 2
indx <- a[order(-Z), .I[seq_len(n)], by = X]$V1
a[indx]
# ID X Z
# 1: DEF 1 0.8609154
# 2: GHI 1 0.6660838
# 3: GHI 3 0.6935913
# 4: ABC 3 0.6092747
# 5: DEF 2 0.6403106
# 6: ABC 2 0.6222994
# 7: ABC 4 0.6233794
# 8: GHI 4 0.5449748
# 9: GHI 5 0.2827336
# 10: NA NA NA
如果需要有序的結果,這也應該很快
setorder(a, X, -Z)
indx <- a[, .I[seq_len(n)], by = X]$V1
a[indx]
# ID X Z
# 1: DEF 1 0.8609154
# 2: GHI 1 0.6660838
# 3: DEF 2 0.6403106
# 4: ABC 2 0.6222994
# 5: GHI 3 0.6935913
# 6: ABC 3 0.6092747
# 7: ABC 4 0.6233794
# 8: GHI 4 0.5449748
# 9: GHI 5 0.2827336
# 10: NA NA NA
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.