简体   繁体   English

根据列条目(或排名)对数据框进行子集

[英]Subset a data frame based on column entry (or rank)

I have a data.frame as simple as this one:我有一个像这个一样简单的data.frame:

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  67
3  3     3_3  76

from where I want to retrieve a subset with the first entries of each group;我想从哪里检索每个组的第一个条目的子集; something like:就像是:

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id is not unique so the approach should not rely on it. id 不是唯一的,因此该方法不应依赖它。

Can I achieve this avoiding loops?我可以实现这个避免循环吗?

dput() of data: dput()的数据:

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", 
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", 
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))

Using Gavin's million row df:使用 Gavin 的百万行 df:

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

I think the fastest way is to reorder the data frame and then use duplicated :我认为最快的方法是重新排序数据框,然后使用duplicated

system.time({
  DF4 <- DF3[order(DF3$group), ]
  out2 <- DF4[!duplicated(DF4$group), ]
})
# user  system elapsed 
# 0.335   0.107   0.441

This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.相比之下,在我的计算机上 Gavin 的 fastet lapply + split 方法需要 7 秒。

Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.通常,在处理数据帧时,最快的方法通常是生成所有索引,然后生成单个子集。

Update in light of OP's comment根据OP的评论更新

If doing this on million+ rows, all options thus supplied will be slow.如果在超过百万行上执行此操作,那么提供的所有选项都会很慢。 Here are some comparison timings on a dummy data set of 100,000 rows:以下是 100,000 行的虚拟数据集的一些比较时间:

set.seed(12)
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE),
                  group = factor(rep(1:100, each = 1000)),
                  value = runif(100000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1, )))
   user  system elapsed 
 19.594   0.053  19.984 
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1]))
   user  system elapsed 
 12.419   0.141  12.788 

I gave up doing them with a million rows.我放弃了一百万行。 Far faster, believe it or not, is:不管你信不信,更快的是:

out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)),
               byrow = TRUE, nrow = (lev <- length(levels(DF3$group))))
colnames(out2) <- names(DF3)[-4]
rownames(out2) <- seq_len(lev)
out2 <- as.data.frame(out2)
out2$group <- factor(out2$group)
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"),
                   levels = levels(DF3$idu))

The outputs are (effectively) the same:输出(实际上)是相同的:

> all.equal(out1, out2)
[1] TRUE
> all.equal(out1, out3[, c(2,1,3,4)])
[1] "Attributes: < Component 2: Modes: character, numeric >"              
[2] "Attributes: < Component 2: target is character, current is numeric >"

(the difference between out1 (or out2 ) and out3 (the aggregate() version) is just in the rownames of the components.) out1 (或out2 )和out3aggregate()版本)之间的区别仅在于组件的行名。)

with a timing of:时间为:

   user  system elapsed 
  0.163   0.001   0.168

on the 100,000 row problem, and on this million row problem:关于 100,000 行问题和这百万行问题:

set.seed(12)
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

with a timing of时间为

   user  system elapsed 
 11.916   0.000  11.925

Working with the matrix version (that produces out2 ) is quicker doing the million rows that the other versions are at doing the 100,000 row problem.使用矩阵版本(产生out2 )比其他版本处理 100,000 行问题更快地完成百万行。 This just shows that working with matrices is very quick indeed, and the bottleneck in the my do.call() version is rbind() -ing the result together.这只是表明使用矩阵确实非常快,并且我的do.call()版本中的瓶颈是rbind()将结果放在一起。

The million row problem timing was done with:百万行问题的计时是通过以下方式完成的:

system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]),
                                          `[`, 1,)),
                            byrow = TRUE,
                            nrow = (lev <- length(levels(DF3$group))))
             colnames(out4) <- names(DF3)[-4]
             rownames(out4) <- seq_len(lev)
             out4 <- as.data.frame(out4)
             out4$group <- factor(out4$group)
             out4$idu <- factor(paste(out4$id, out4$group, sep = "_"),
                                levels = levels(DF3$idu))})

Original原来的

If your data are in DF , say, then:如果您的数据在DF中,那么:

do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))

will do what you want:会做你想做的事:

> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
  idu group
1   1     1
2   4     2
3   7     3

If the new data are in DF2 then we get:如果新数据在DF2中,那么我们得到:

> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1))
  id group idu value
1  1     1 1_1    34
2  4     2 4_2     6
3  1     3 1_3    34

But for speed, we probably want to subset instead of using head() and we can gain a bit by not using with() , eg:但是为了速度,我们可能想要子集而不是使用head()并且我们可以通过不使用来获得一点with() ,例如:

do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))

> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))))
   user  system elapsed 
  3.847   0.040   4.044
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1))))
   user  system elapsed 
  4.058   0.038   4.111
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1])))
   user  system elapsed 
  3.902   0.042   4.106

I think this will do the trick:我认为这可以解决问题:

aggregate(data["idu"], data["group"], function (x) x[1])

For your updated question, I'd recommend using ddply from the plyr package:对于您更新的问题,我建议使用ddply package 中的plyr

ddply(data, .(group), function (x) x[1,])

One solution using plyr , assuming your data is in an object named zzz :一种使用plyr的解决方案,假设您的数据位于名为zzz的 object 中:

ddply(zzz, "group", function(x) x[1 ,])

Another option that takes the difference between rows and should prove faster, but relies on the object being ordered before hand.另一种选择行之间的差异并且应该证明更快,但依赖于事先订购的 object。 This also assumes you don't have a group value of 0:这还假设您的组值不为 0:

zzz <- zzz[order(zzz$group) ,]

zzz[ diff(c(0,zzz$group)) != 0, ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM