找出R中每個因子的2個最大值

Question

我有一個問題，關於為A列中的每個唯一ID查找C列的兩個最大值，然后計算B列的平均值。我的數據示例如下：

ID  layer   weight
1   0.6843629   0.35
1   0.6360772   0.70
1   0.6392318   0.14
2   0.3848640   0.05
2   0.3882660   0.30
2   0.3877026   0.10
2   0.3964194   0.60
2   0.4273218   0.02
2   0.3869507   0.12
3   0.4748541   0.07
3   0.5853659   0.42
3   0.5383678   0.10
3   0.6060287   0.60
4   0.4859274   0.08
4   0.4720740   0.48
4   0.5126481   0.08
4   0.5280899   0.48
5   0.7492097   0.07
5   0.7220433   0.35
5   0.8750000   0.10
5   0.8302752   0.50
6   0.4306283   0.10
6   0.4890895   0.25
6   0.3790714   0.20
6   0.5139686   0.50
6   0.3885678   0.02
6   0.4706815   0.05

對於每個ID，我只想使用權重最高的兩個行來計算圖層的平均值。

我可以使用R中的以下代碼來做到這一點：

ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),]) 
    dt1 <- data.table(index1, key=c("layer"))
    dt2 <- data.table(ind.max1, key=c("layer"))
    index2 <- dt1[!dt2]
    ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))

這使用ddply選擇每個ID的第一個最高權重值，並將其放入具有圖層的數據幀中。 然后使用data.table從原始數據data.table刪除這些最高權重值。 然后，我重復ddply選擇最大值，並將兩個最大權重值數據幀合並為一個。 最后， tapply計算平均值。 必須有一種更有效的方法來執行此操作。 有人有見識嗎？ 干杯。

Answer 1

您可以使用data.table

 library(data.table)
 setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
 #   ID Meanlayer
 #1:  1 0.6602200
 #2:  2 0.3923427
 #3:  3 0.5956973
 #4:  4 0.5000819
 #5:  5 0.7761593
 #6:  6 0.5015291

訂單weight列按降序order(-weight)
從由組ID創建的[1:2]順序中選擇前兩個
根據索引layer[order..]子集相應的layer行
mean

另外，在1.9.3 （當前開發版本）或從下一版本開始，將導出一個函數setorder以便按任何順序對data.tables進行重新排序，方法是：

require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]

通過先排序，我們避免了每個組都調用order() （ ID唯一值）。 如果有更多的團體，這將更加有利。 setorder()比order()效率更高order()因為它不需要創建數據副本。

Answer 2

無論如何，這實際上是StackOverflow的問題！ 不知道下面的版本是否對您足夠有效...

s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)

foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)

Answer 3

我認為這可以做到。 假設數據稱為dat ，

> sapply(split(dat, dat$ID), function(x) { 
      with(x, {
          mean(layer[ weight %in% rev(sort(weight))[1:2] ])
          })
      })
#         1         2         3         4         5         6 
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291

您可能需要將na.rm = TRUE作為第二個參數來mean要解釋包含NA值的任何行。

另外， mapply可能更快，並且具有完全相同的代碼，只是順序不同，

mapply(function(x) { 
      with(x, {
          mean(layer[ weight %in% rev(sort(weight))[1:2] ])
          })
      }, split(dat, dat$ID))

找出R中每個因子的2個最大值

問題描述

3 個解決方案

解決方案1
3 2014-08-09 06:40:08

解決方案2
1 2014-08-09 01:58:35

解決方案3
0 2014-08-09 02:58:16

找出R中每個因子的2個最大值

問題描述

3 個解決方案

解決方案1 3 2014-08-09 06:40:08

解決方案2 1 2014-08-09 01:58:35

解決方案3 0 2014-08-09 02:58:16

解決方案1
3 2014-08-09 06:40:08

解決方案2
1 2014-08-09 01:58:35

解決方案3
0 2014-08-09 02:58:16