简体   繁体   English

选择行,使多列没有重复的值

[英]Select rows such that multiple columns have no duplicated values

I have a data frame with these values(built in such a way): 我有一个具有这些值的数据框(以这种方式构建):

id1 = (c(1,1,2,2))
id2 = (c(10,11,10,11))
value =c(50,50,50,50)
df = data.frame(id1,id2,value)

df : 
  value id1 id2
1    50   1  10
2    50   1  11
3    50   2  10
4    50   2  11

I would like to keep only rows where both id1 and id2 are unique(each value of id1 and id2 must appear only once),also there might be more then one duplicate of each id: 我只想保留id1和id2都是唯一的行(id1和id2的每个值必须只出现一次),每个ID可能再有一个重复项:

df_unique : 
value id1 id2
1    50   1  10
4    50   2  11

if I use the duplicated command on one of the columns and then the other,I would discard wanted rows. 如果我在其中一列上使用重复的命令,然后在另一列上使用,则将丢弃想要的行。

A solution which will return (1,11) and (2,10) is also good,as long as each element in id1 and id2 are unique. 只要id1和id2中的每个元素都是唯一的,则返回(1,11)和(2,10)的解决方案也是好的。

Another example with more rows: 带有更多行的另一个示例:

id1 = (c(1,1,1,2,2,2,3,3,3))
id2 = (c(10,11,12,10,11,12,10,11,12))
value =rep(50,9)
df = data.frame(id1,id2,value)

df:
  id1 id2 value
1   1  10    50
2   1  11    50
3   1  12    50
4   2  10    50
5   2  11    50
6   2  12    50
7   3  10    50
8   3  11    50
9   3  12    50

Where a good answer is:(1,10),(2,11),(3,12), but also any other answer where both id1 and id2 appear once are good. 好的答案是:(1,10),(2,11),(3,12),但id1和id2都出现一次的其他答案也是好的。

Thank you, 谢谢,

Jacob 雅各布

If you know that the data are arranged as in your example, cycling through id2 for each value of id1 and in the same order, the solution is easy: 如果您知道数据是按照示例中的顺序排列的,则对id1每个值以相同的顺序循环遍历id2 ,则解决方案很简单:

N <- 3 # Number of rows in the result
idx <- seq(1, N*N, by=N) + seq(0,to=N-1)
df[idx,]
##   id1 id2 value
## 1   1  10    50
## 5   2  11    50
## 9   3  12    50

I doubt that this is what you're asking. 我怀疑这就是您要的内容。 If the rows are in an unknown order or not all values are present in one column for each value in the other, you have to check each combination of N rows. 如果行的顺序未知,或者不是所有值都出现在一个列中,而另一列中的每个值都存在,则必须检查N行的每种组合。

# Maximum number of result rows
N <- with(df, min(length(unique(id1)), length(unique(id2))))
N
## [1] 3

# Potential indices
index <- combn(seq(nrow(df)), N)

index is a matrix where each column represents three rows in df . index是一个矩阵,其中每一列代表df三行。 Now to check for duplicated values: 现在检查重复值:

good <- apply(index, 2, function(x) !any(duplicated(df[x,'id1']) | duplicated(df[x,'id2'])))

good has the value TRUE for a combination of rows that passes the test. 对于通过测试的行组合, good的值为TRUE

which(good)
## [1] 22 24 39 44 53 56
index[, good]
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    2    2    3    3
## [2,]    5    6    4    6    4    5
## [3,]    9    8    9    7    8    7

Each column of the above matrix represents a combination of rows that passes the test. 上述矩阵的每一列代表通过测试的行的组合。

This finds all the combinations. 找到所有组合。 You might want to find just the first combination, so that you don't go on to test additional combinations after a hit is found. 您可能只想找到第一个组合,这样就不会在找到匹配项后继续测试其他组合。 Then for is appropriate: 然后for适合:

for (i in seq(ncol(index))) {
  x <- index[,i]
  if (!any(duplicated(df[x,'id1']) | duplicated(df[x,'id2']))) {
    rows <- x
    break
  }
}

df[rows,]
##   id1 id2 value
## 1   1  10    50
## 5   2  11    50
## 9   3  12    50

Note: Depending on the data, it is possible that with N=3 , you will get no rows that pass the test. 注意:根据数据,在N=3 ,可能没有行通过测试。 In that case, repeat the procedure with N=2 , and so on. 在这种情况下,以N=2重复该过程,依此类推。 I leave that loop as an exercise for the reader. 我将该循环留给读者练习。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM