[英]how to remove unique entry and keep duplicates in R
ID Cat1 Cat2 Cat3 Cat4
A0001 358 11.25 37428 0
A0001 279 14.6875 38605 0
A0013 367 5.125 40152 1
A0014 337 16.3125 38624 0
A0020 367 8.875 37797 0
A0020 339 9.625 39324 0
I need help learning to how remove the unique rows in my file while keeping the duplicates or triplicates.我需要帮助学习如何在保留重复或三次重复的同时删除文件中的唯一行。 For example, output should look like below:
例如,输出应如下所示:
ID Cat1 Cat2 Cat3 Cat4
A0001 358 11.25 37428 0
A0001 279 14.6875 38605 0
A0020 367 8.875 37797 0
A0020 339 9.625 39324 0
If you can give me advice how to approach this problem, much appreciated.如果你能给我建议如何解决这个问题,非常感谢。
Thanks for everyone's suggestions.谢谢大家的建议。 I wanted to calculate the difference in value in the different Categories (ie Cat2, Cat 3) between the repeated measures (by unique ID).
我想计算重复度量(通过唯一 ID)之间不同类别(即 Cat2、Cat 3)中的值差异。 Would appreciate any suggestions.
将不胜感激任何建议。
Another option in base R Using duplicated
基础 R 中的另一个选项使用
duplicated
dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 5 A0020 367 8.8750 37797 0
# 6 A0020 339 9.6250 39324 0
using duplicated
and fromLast
version you get :使用
duplicated
和fromLast
版本你得到:
library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
This can be applied to base R also but I prefer data.table here for syntax sugar.这也可以应用于基础 R,但我更喜欢 data.table 这里的语法糖。
General comments.普通的留言。
ave
approach is the only one here that preserves the data's initial row ordering. ave
方法是这里唯一保留数据的初始行顺序的方法。by
approach should be very slow. by
方法应该很慢。 I suspect that data.table and dplyr are not much faster than ave
and tapply
(yet) at selecting groups.ave
和tapply
(还)快多少。 Benchmarks to prove me wrong welcome! base R (Thanks to @thelatemail for both of the first two approaches.) base R (感谢@thelatemail 提供了前两种方法。)
1) Each row is assigned the length of its df$ID
group, and we filter based on the vector of lengths. 1) 每一行都被分配了它的
df$ID
组的长度,我们根据长度向量进行过滤。
df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]
2) Alternately, we split row names or numbers by df$ID
, selecting which groups' rows to keep. 2) 或者,我们按
df$ID
拆分行名称或编号,选择要保留的组行。 tapply
returns a list of groups of rows, so we must unlist
them into a single vector of rows. tapply
返回一组行的列表,因此我们必须unlist
它们unlist
到单个行向量中。
df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]
What follows is a worse approach, but better parallels what you see with data.table and dplyr:下面是一个更糟糕的方法,但更好地与你看到的 data.table 和 dplyr 相似:
3) The data is split by df$ID
, keeping each subset of data, SD
if if has more than one row. 3) 数据按
df$ID
分割,保留每个数据子集,如果有多于一行,则为SD
。 by
returns a list, so we must rbind
them back together. by
返回一个列表,所以我们必须将它们重新rbind
在一起。
do.call( rbind, c(list(make.row.names = FALSE),
by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))
data.table .N
corresponds to nrow
within a by=ID
group; data.table
.N
对应于by=ID
组中的nrow
; and .SD
is the subset of data. .SD
是数据的子集。
library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
dplyr n()
corresponds to nrow
within a group_by(ID)
group. dplyr
n()
对应于group_by(ID)
组中的nrow
。
library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )
# Source: local data frame [4 x 5]
# Groups: ID
#
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 3 A0020 367 8.8750 37797 0
# 4 A0020 339 9.6250 39324 0
我知道这是一个老问题,但我遇到了同样的问题,发现这个解决方案最简单:
data<- data[duplicated(data$ID)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.