根据 R 中列中值的出现对数据集进行子集化

Question

I have a dataset (ds) with two columns.我有一个包含两列的数据集 (ds)。 There are either one or two entries with the same number in "match". “匹配”中有一个或两个具有相同编号的条目。 "status" is a binary variable. “状态”是一个二进制变量。 There are pairs, for example, the value 12 in match appears twice, one for where status is 1 and 0 for the other.有成对的，例如match中的值12出现了两次，一次是status为1的，一次是0。 Yet, there are also observations in match who do not have a partner, in this dataset it would be 3,8,33,17 who have no partner.然而，在比赛中也有没有伴侣的观察结果，在这个数据集中，没有伴侣的是 3、8、33、17。

match     status
12          1 
3           1
5           0
8           1
33          0
5           1
12          0
17          0

What I want to do is to create a new dataset that only contains observations of pairs (thus if a value appears twice).我想要做的是创建一个新的数据集，它只包含对的观察（因此如果一个值出现两次）。 In my example, it would be在我的例子中，它将是

match     status
12          1
12          0
5           0
5           1

The status variable in the final dataset would be 50/50 because a value in match (for example 12) has an observation where status = 0 and one where status = 1. The actual dataset I´m working with has over 50k observations so I cannot just search and filter by each number.最终数据集中的状态变量将为 50/50，因为匹配中的值（例如 12）有一个状态 = 0 的观察值和一个状态 = 1 的观察值。我使用的实际数据集有超过 50k 个观察值，所以我不能只按每个数字搜索和过滤。 What I tried is:我试过的是：

numbers <- table(ds$match)
numbers <- as.data.frame(numbers)
numbers <- numbers[numbers$Freq == 2,]
numbers <- numbers$Var1
ds$keep <- ifelse(numbers %in% ds$match, 1, 0)

Here I get the error "replacement has 23005 rows, data has 39021" If I could get around this error I think I could just run在这里我得到错误“替换有 23005 行，数据有 39021”如果我能解决这个错误我想我可以运行

ds <- filter(ds, ds$keep == 1)

to get the dataset that I want.获取我想要的数据集。 This was my most promising approach.这是我最有希望的方法。 I tried a few other things but it always came done to the fact that the status variable wasn´t 50/50 so I couldn´t manage to exclude all observations without a pair.我尝试了一些其他的事情，但它总是因为状态变量不是 50/50 的事实而完成，所以我无法在没有一对的情况下排除所有观察结果。 Does someone have an idea how I could fix my code or is there a solution that would be quicker/more smooth?有人知道如何修复我的代码，或者是否有更快/更流畅的解决方案？ Thanks for any help in advance!提前感谢您的帮助！

Answer 1

library(dplyr)

ds %>% group_by(match) %>% filter(n()>1) %>% arrange(match,status)

  match status
  <dbl>  <dbl>
1     5      0
2     5      1
3    12      0
4    12      1

You can also do something like this:你也可以这样做：

ds <- ds[order(ds$match),]
id = rle(ds$match)
ds[ds$match %in% id$values[id$lengths>1],]

  match status
  <dbl>  <dbl>
1     5      0
2     5      1
3    12      1
4    12      0

根据 R 中列中值的出现对数据集进行子集化

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-03-11 19:08:11

根据 R 中列中值的出现对数据集进行子集化

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-03-11 19:08:11

解决方案1
1 已采纳 2022-03-11 19:08:11