简体   繁体   English

如何删除唯一条目并在R中保留重复项

[英]how to remove unique entry and keep duplicates in R

ID     Cat1  Cat2    Cat3   Cat4
A0001   358 11.25   37428   0
A0001   279 14.6875 38605   0
A0013   367 5.125   40152   1
A0014   337 16.3125 38624   0
A0020   367 8.875   37797   0
A0020   339 9.625   39324   0

I need help learning to how remove the unique rows in my file while keeping the duplicates or triplicates.我需要帮助学习如何在保留重复或三次重复的同时删除文件中的唯一行。 For example, output should look like below:例如,输出应如下所示:

ID     Cat1  Cat2    Cat3   Cat4
A0001   358 11.25   37428   0
A0001   279 14.6875 38605   0
A0020   367 8.875   37797   0
A0020   339 9.625   39324   0

If you can give me advice how to approach this problem, much appreciated.如果你能给我建议如何解决这个问题,非常感谢。

Thanks for everyone's suggestions.谢谢大家的建议。 I wanted to calculate the difference in value in the different Categories (ie Cat2, Cat 3) between the repeated measures (by unique ID).我想计算重复度量(通过唯一 ID)之间不同类别(即 Cat2、Cat 3)中的值差异。 Would appreciate any suggestions.将不胜感激任何建议。

Another option in base R Using duplicated基础 R 中的另一个选项使用duplicated

dx[dx$ID %in% dx$ID[duplicated(dx$ID)],]

#      ID Cat1    Cat2  Cat3 Cat4
# 1 A0001  358 11.2500 37428    0
# 2 A0001  279 14.6875 38605    0
# 5 A0020  367  8.8750 37797    0
# 6 A0020  339  9.6250 39324    0

data.table using duplicated使用重复的数据表

using duplicated and fromLast version you get :使用duplicatedfromLast版本你得到:

library(data.table)
setkey(setDT(dx),ID) # or with data.table 1.9.5+: setDT(dx,key="ID")
dx[duplicated(dx) |duplicated(dx,fromLast=T)]

#       ID Cat1    Cat2  Cat3 Cat4
# 1: A0001  358 11.2500 37428    0
# 2: A0001  279 14.6875 38605    0
# 3: A0020  367  8.8750 37797    0
# 4: A0020  339  9.6250 39324    0

This can be applied to base R also but I prefer data.table here for syntax sugar.这也可以应用于基础 R,但我更喜欢 data.table 这里的语法糖。

General comments.普通的留言。

  • The ave approach is the only one here that preserves the data's initial row ordering. ave方法是这里唯一保留数据的初始行顺序的方法。
  • The by approach should be very slow. by方法应该慢。 I suspect that data.table and dplyr are not much faster than ave and tapply (yet) at selecting groups.我怀疑 data.table 和 dplyr 在选择组时并不比avetapply (还)快多少。 Benchmarks to prove me wrong welcome!欢迎用基准证明我的错误!

base R (Thanks to @thelatemail for both of the first two approaches.) base R (感谢@thelatemail 提供了前两种方法。)

1) Each row is assigned the length of its df$ID group, and we filter based on the vector of lengths. 1) 每一行都被分配了它的df$ID组的长度,我们根据长度向量进行过滤。

df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]

2) Alternately, we split row names or numbers by df$ID , selecting which groups' rows to keep. 2) 或者,我们按df$ID拆分行名称或编号,选择要保留的组行。 tapply returns a list of groups of rows, so we must unlist them into a single vector of rows. tapply返回一组行的列表,因此我们必须unlist它们unlist到单个行向量中。

df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]

What follows is a worse approach, but better parallels what you see with data.table and dplyr:下面是一个更糟糕的方法,但更好地与你看到的 data.table 和 dplyr 相似:

3) The data is split by df$ID , keeping each subset of data, SD if if has more than one row. 3) 数据按df$ID分割,保留每个数据子集,如果有多于一行,则为SD by returns a list, so we must rbind them back together. by返回一个列表,所以我们必须将它们重新rbind在一起。

do.call( rbind, c(list(make.row.names = FALSE),
    by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))

data.table .N corresponds to nrow within a by=ID group; data.table .N对应于by=ID组中的nrow and .SD is the subset of data. .SD是数据的子集。

library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]

#       ID Cat1    Cat2  Cat3 Cat4
# 1: A0001  358 11.2500 37428    0
# 2: A0001  279 14.6875 38605    0
# 3: A0020  367  8.8750 37797    0
# 4: A0020  339  9.6250 39324    0

dplyr n() corresponds to nrow within a group_by(ID) group. dplyr n()对应于group_by(ID)组中的nrow

library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )

# Source: local data frame [4 x 5]
# Groups: ID
# 
#      ID Cat1    Cat2  Cat3 Cat4
# 1 A0001  358 11.2500 37428    0
# 2 A0001  279 14.6875 38605    0
# 3 A0020  367  8.8750 37797    0
# 4 A0020  339  9.6250 39324    0

我知道这是一个老问题,但我遇到了同样的问题,发现这个解决方案最简单:

data<- data[duplicated(data$ID)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM