[英]Selecting observations for which two years are available by country
I have a dataset as follows:我有一个数据集如下:
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
I would like to keep only observations for which I have observations in two country-years.我想只保留我在两个国家年中观察到的观察结果。 So,
BEL
will drop out because it only has observations in 2002.因此,
BEL
将退出,因为它只有 2002 年的观测值。
I would like to do something like DT[,if(unique(year)>1) .SD, by=country]
but that does not do anything.我想做一些类似
DT[,if(unique(year)>1) .SD, by=country]
事情,但这并没有做任何事情。 I also tried DT[unique(year)>1, .SD, by=country]
but this gives the error:我也试过
DT[unique(year)>1, .SD, by=country]
但这给出了错误:
Error in `[.data.table`(DT, unique(year) > 1, .SD, by = country) :
i evaluates to a logical vector length 4 but there are 10 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
Desired output:期望的输出:
DT <- fread(
"ID country year Event_A Event_B
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
If it's not necessary to do it in data.table, you can count the number of distinct years by country via base R:如果没有必要在 data.table 中执行此操作,您可以通过 base R 计算按国家/地区划分的不同年份的数量:
country_count <- aggregate(year ~ country, DT, FUN = function(x) NROW(unique(x)))
DT[DT$country %in% country_count$country[country_count$year > 1],]
# output
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1
In the same spirit as @user2474226
, if you're open to other packages, a simple dplyr
solution:本着与
@user2474226
相同的精神,如果您对其他软件包开放,一个简单的dplyr
解决方案:
library(data.table)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
DT <- fread(
"ID country year Event_A Event_B
4 BEL 2002 0 1
5 BEL 2002 0 1
6 NLD 2002 1 1
7 NLD 2006 1 0
8 NLD 2006 1 1
9 GBR 2001 0 1
10 GBR 2001 0 0
11 GBR 2001 0 1
12 GBR 2007 1 1
13 GBR 2007 1 1",
header = TRUE)
# Detect count of countries
sel_cnt <-
DT %>%
count(country, year) %>%
count(country) %>%
filter(n > 1)
DT %>%
semi_join(sel_cnt, by = "country")
#> ID country year Event_A Event_B
#> 1 6 NLD 2002 1 1
#> 2 7 NLD 2006 1 0
#> 3 8 NLD 2006 1 1
#> 4 9 GBR 2001 0 1
#> 5 10 GBR 2001 0 0
#> 6 11 GBR 2001 0 1
#> 7 12 GBR 2007 1 1
#> 8 13 GBR 2007 1 1
Here is a base R solution by using ave()
and subset()
这是使用
ave()
和subset()
的基本 R 解决方案
DTout <- subset(DT, as.logical(ave(DT$year,DT$country, FUN = function(x) length(unique(x))>=2)))
such that以至于
> DTout
ID country year Event_A Event_B
3 6 NLD 2002 1 1
4 7 NLD 2006 1 0
5 8 NLD 2006 1 1
6 9 GBR 2001 0 1
7 10 GBR 2001 0 0
8 11 GBR 2001 0 1
9 12 GBR 2007 1 1
10 13 GBR 2007 1 1
You can use uniqueN
to get count of unique values and select rows using .SD
.您可以使用
uniqueN
来获取唯一值的计数并使用.SD
选择行。
library(data.table)
DT[, .SD[uniqueN(year) > 1], country]
# country ID year Event_A Event_B
#1: NLD 6 2002 1 1
#2: NLD 7 2006 1 0
#3: NLD 8 2006 1 1
#4: GBR 9 2001 0 1
#5: GBR 10 2001 0 0
#6: GBR 11 2001 0 1
#7: GBR 12 2007 1 1
#8: GBR 13 2007 1 1
Or in dplyr
we can do the same with n_distinct
and filter
或者在
dplyr
我们可以用n_distinct
和filter
做同样的n_distinct
library(dplyr)
DT %>% group_by(country) %>% filter(n_distinct(year) > 1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.