按国家选择两年可用的观测值

Question

I have a dataset as follows:我有一个数据集如下：

DT <- fread(
"ID country year Event_A Event_B
4   BEL   2002  0   1
5   BEL   2002  0   1
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

I would like to keep only observations for which I have observations in two country-years.我想只保留我在两个国家年中观察到的观察结果。 So, BEL will drop out because it only has observations in 2002.因此， BEL将退出，因为它只有 2002 年的观测值。

I would like to do something like DT[,if(unique(year)>1) .SD, by=country] but that does not do anything.我想做一些类似DT[,if(unique(year)>1) .SD, by=country]事情，但这并没有做任何事情。 I also tried DT[unique(year)>1, .SD, by=country] but this gives the error:我也试过DT[unique(year)>1, .SD, by=country]但这给出了错误：

Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) : 
  i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

Desired output:期望的输出：

DT <- fread(
"ID country year Event_A Event_B
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

Answer 1

If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:如果没有必要在 data.table 中执行此操作，您可以通过 base R 计算按国家/地区划分的不同年份的数量：

country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
   ID country year Event_A Event_B
3   6     NLD 2002       1       1
4   7     NLD 2006       1       0
5   8     NLD 2006       1       1
6   9     GBR 2001       0       1
7  10     GBR 2001       0       0
8  11     GBR 2001       0       1
9  12     GBR 2007       1       1
10 13     GBR 2007       1       1

Answer 2

In the same spirit as @user2474226 , if you're open to other packages, a simple dplyr solution:本着与@user2474226相同的精神，如果您对其他软件包开放，一个简单的dplyr解决方案：

 library(data.table)
 library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

  DT <- fread(
    "ID country year Event_A Event_B
4   BEL   2002  0   1
5   BEL   2002  0   1
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

  # Detect count of countries
  sel_cnt <-
    DT %>%
    count(country, year) %>%
    count(country) %>%
    filter(n > 1)


  DT %>%
    semi_join(sel_cnt, by = "country")
#>   ID country year Event_A Event_B
#> 1  6     NLD 2002       1       1
#> 2  7     NLD 2006       1       0
#> 3  8     NLD 2006       1       1
#> 4  9     GBR 2001       0       1
#> 5 10     GBR 2001       0       0
#> 6 11     GBR 2001       0       1
#> 7 12     GBR 2007       1       1
#> 8 13     GBR 2007       1       1

Answer 3

Here is a base R solution by using ave() and subset()这是使用ave()和subset()的基本 R 解决方案

DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))

such that以至于

> DTout
   ID country year Event_A Event_B
3   6     NLD 2002       1       1
4   7     NLD 2006       1       0
5   8     NLD 2006       1       1
6   9     GBR 2001       0       1
7  10     GBR 2001       0       0
8  11     GBR 2001       0       1
9  12     GBR 2007       1       1
10 13     GBR 2007       1       1

Answer 4

You can use uniqueN to get count of unique values and select rows using .SD .您可以使用uniqueN来获取唯一值的计数并使用.SD选择行。

library(data.table)
DT[, .SD[uniqueN(year) > 1], country]

#   country ID year Event_A Event_B
#1:     NLD  6 2002       1       1
#2:     NLD  7 2006       1       0
#3:     NLD  8 2006       1       1
#4:     GBR  9 2001       0       1
#5:     GBR 10 2001       0       0
#6:     GBR 11 2001       0       1
#7:     GBR 12 2007       1       1
#8:     GBR 13 2007       1       1

Or in dplyr we can do the same with n_distinct and filter或者在dplyr我们可以用n_distinct和filter做同样的n_distinct

library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)

按国家选择两年可用的观测值

问题描述

4 个解决方案

解决方案1
1 2019-12-19 10:34:02

解决方案2
1 2019-12-19 10:38:31

解决方案3
1 2019-12-19 10:48:31

解决方案4
1 已采纳 2019-12-19 11:05:21

按国家选择两年可用的观测值

问题描述

4 个解决方案

解决方案1 1 2019-12-19 10:34:02

解决方案2 1 2019-12-19 10:38:31

解决方案3 1 2019-12-19 10:48:31

解决方案4 1 已采纳 2019-12-19 11:05:21

解决方案1
1 2019-12-19 10:34:02

解决方案2
1 2019-12-19 10:38:31

解决方案3
1 2019-12-19 10:48:31

解决方案4
1 已采纳 2019-12-19 11:05:21