根據列條目（或排名）對數據框進行子集

Question

我有一個像這個一樣簡單的data.frame：

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  67
3  3     3_3  76

我想從哪里檢索每個組的第一個條目的子集； 就像是：

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id 不是唯一的，因此該方法不應依賴它。

我可以實現這個避免循環嗎？

dput()的數據：

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", 
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", 
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))

Answer 1

使用 Gavin 的百萬行 df：

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

我認為最快的方法是重新排序數據框，然后使用duplicated ：

system.time({
  DF4 <- DF3[order(DF3$group), ]
  out2 <- DF4[!duplicated(DF4$group), ]
})
# user  system elapsed 
# 0.335   0.107   0.441

相比之下，在我的計算機上 Gavin 的 fastet lapply + split 方法需要 7 秒。

通常，在處理數據幀時，最快的方法通常是生成所有索引，然后生成單個子集。

Answer 2

根據OP的評論更新

如果在超過百萬行上執行此操作，那么提供的所有選項都會很慢。 以下是 100,000 行的虛擬數據集的一些比較時間：

set.seed(12)
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE),
                  group = factor(rep(1:100, each = 1000)),
                  value = runif(100000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1, )))
   user  system elapsed 
 19.594   0.053  19.984 
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1]))
   user  system elapsed 
 12.419   0.141  12.788

我放棄了一百萬行。 不管你信不信，更快的是：

out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)),
               byrow = TRUE, nrow = (lev <- length(levels(DF3$group))))
colnames(out2) <- names(DF3)[-4]
rownames(out2) <- seq_len(lev)
out2 <- as.data.frame(out2)
out2$group <- factor(out2$group)
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"),
                   levels = levels(DF3$idu))

輸出（實際上）是相同的：

> all.equal(out1, out2)
[1] TRUE
> all.equal(out1, out3[, c(2,1,3,4)])
[1] "Attributes: < Component 2: Modes: character, numeric >"              
[2] "Attributes: < Component 2: target is character, current is numeric >"

（ out1 （或out2 ）和out3 （ aggregate()版本）之間的區別僅在於組件的行名。）

時間為：

   user  system elapsed 
  0.163   0.001   0.168

關於 100,000 行問題和這百萬行問題：

set.seed(12)
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

時間為

   user  system elapsed 
 11.916   0.000  11.925

使用矩陣版本（產生out2 ）比其他版本處理 100,000 行問題更快地完成百萬行。 這只是表明使用矩陣確實非常快，並且我的do.call()版本中的瓶頸是rbind()將結果放在一起。

百萬行問題的計時是通過以下方式完成的：

system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]),
                                          `[`, 1,)),
                            byrow = TRUE,
                            nrow = (lev <- length(levels(DF3$group))))
             colnames(out4) <- names(DF3)[-4]
             rownames(out4) <- seq_len(lev)
             out4 <- as.data.frame(out4)
             out4$group <- factor(out4$group)
             out4$idu <- factor(paste(out4$id, out4$group, sep = "_"),
                                levels = levels(DF3$idu))})

原來的

如果您的數據在DF中，那么：

do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))

會做你想做的事：

> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
  idu group
1   1     1
2   4     2
3   7     3

如果新數據在DF2中，那么我們得到：

> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1))
  id group idu value
1  1     1 1_1    34
2  4     2 4_2     6
3  1     3 1_3    34

但是為了速度，我們可能想要子集而不是使用head()並且我們可以通過不使用來獲得一點with() ，例如：

do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))

> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))))
   user  system elapsed 
  3.847   0.040   4.044
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1))))
   user  system elapsed 
  4.058   0.038   4.111
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1])))
   user  system elapsed 
  3.902   0.042   4.106

Answer 3

我認為這可以解決問題：

aggregate(data["idu"], data["group"], function (x) x[1])

對於您更新的問題，我建議使用ddply package 中的plyr ：

ddply(data, .(group), function (x) x[1,])

Answer 4

一種使用plyr的解決方案，假設您的數據位於名為zzz的 object 中：

ddply(zzz, "group", function(x) x[1 ,])

另一種選擇行之間的差異並且應該證明更快，但依賴於事先訂購的 object。 這還假設您的組值不為 0：

zzz <- zzz[order(zzz$group) ,]

zzz[ diff(c(0,zzz$group)) != 0, ]

根據列條目（或排名）對數據框進行子集

問題描述

4 個解決方案

解決方案1
10 已采納 2011-04-28 14:37:25

解決方案2
5 2011-04-27 14:00:20

解決方案3
1 2011-04-27 13:58:05

解決方案4
1 2011-04-27 14:07:25

根據列條目（或排名）對數據框進行子集

問題描述

4 個解決方案

解決方案1 10 已采納 2011-04-28 14:37:25

解決方案2 5 2011-04-27 14:00:20

解決方案3 1 2011-04-27 13:58:05

解決方案4 1 2011-04-27 14:07:25

解決方案1
10 已采納 2011-04-28 14:37:25

解決方案2
5 2011-04-27 14:00:20

解決方案3
1 2011-04-27 13:58:05

解決方案4
1 2011-04-27 14:07:25