简体   繁体   English

按国家选择两年可用的观测值

[英]Selecting observations for which two years are available by country

I have a dataset as follows:我有一个数据集如下:

DT <- fread(
"ID country year Event_A Event_B
4   BEL   2002  0   1
5   BEL   2002  0   1
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

I would like to keep only observations for which I have observations in two country-years.我想只保留我在两个国家年中观察到的观察结果。 So, BEL will drop out because it only has observations in 2002.因此, BEL将退出,因为它只有 2002 年的观测值。

I would like to do something like DT[,if(unique(year)>1) .SD, by=country] but that does not do anything.我想做一些类似DT[,if(unique(year)>1) .SD, by=country]事情,但这并没有做任何事情。 I also tried DT[unique(year)>1, .SD, by=country] but this gives the error:我也试过DT[unique(year)>1, .SD, by=country]但这给出了错误:

Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) : 
  i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

Desired output:期望的输出:

DT <- fread(
"ID country year Event_A Event_B
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:如果没有必要在 data.table 中执行此操作,您可以通过 base R 计算按国家/地区划分的不同年份的数量:

country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
   ID country year Event_A Event_B
3   6     NLD 2002       1       1
4   7     NLD 2006       1       0
5   8     NLD 2006       1       1
6   9     GBR 2001       0       1
7  10     GBR 2001       0       0
8  11     GBR 2001       0       1
9  12     GBR 2007       1       1
10 13     GBR 2007       1       1

In the same spirit as @user2474226 , if you're open to other packages, a simple dplyr solution:本着与@user2474226相同的精神,如果您对其他软件包开放,一个简单的dplyr解决方案:

 library(data.table)
 library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#> 
#>     between, first, last
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

  DT <- fread(
    "ID country year Event_A Event_B
4   BEL   2002  0   1
5   BEL   2002  0   1
6   NLD   2002  1   1
7   NLD   2006  1   0
8   NLD   2006  1   1
9   GBR   2001  0   1
10  GBR   2001  0   0
11  GBR   2001  0   1
12  GBR   2007  1   1
13  GBR   2007  1   1",
header = TRUE)

  # Detect count of countries
  sel_cnt <-
    DT %>%
    count(country, year) %>%
    count(country) %>%
    filter(n > 1)


  DT %>%
    semi_join(sel_cnt, by = "country")
#>   ID country year Event_A Event_B
#> 1  6     NLD 2002       1       1
#> 2  7     NLD 2006       1       0
#> 3  8     NLD 2006       1       1
#> 4  9     GBR 2001       0       1
#> 5 10     GBR 2001       0       0
#> 6 11     GBR 2001       0       1
#> 7 12     GBR 2007       1       1
#> 8 13     GBR 2007       1       1

Here is a base R solution by using ave() and subset()这是使用ave()subset()的基本 R 解决方案

DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))

such that以至于

> DTout
   ID country year Event_A Event_B
3   6     NLD 2002       1       1
4   7     NLD 2006       1       0
5   8     NLD 2006       1       1
6   9     GBR 2001       0       1
7  10     GBR 2001       0       0
8  11     GBR 2001       0       1
9  12     GBR 2007       1       1
10 13     GBR 2007       1       1

You can use uniqueN to get count of unique values and select rows using .SD .您可以使用uniqueN来获取唯一值的计数并使用.SD选择行。

library(data.table)
DT[, .SD[uniqueN(year) > 1], country]

#   country ID year Event_A Event_B
#1:     NLD  6 2002       1       1
#2:     NLD  7 2006       1       0
#3:     NLD  8 2006       1       1
#4:     GBR  9 2001       0       1
#5:     GBR 10 2001       0       0
#6:     GBR 11 2001       0       1
#7:     GBR 12 2007       1       1
#8:     GBR 13 2007       1       1

Or in dplyr we can do the same with n_distinct and filter或者在dplyr我们可以用n_distinctfilter做同样的n_distinct

library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM