So I have this fairly basic problem with R subsetting, but because I'm a newbie I don't know how to solve it properly. There's example of some panel data I have:
idnr year sales space municipality pop
1 1 2004 110000 1095 136 71377
2 1 2005 110000 1095 136 71355
3 1 2006 110000 1095 136 71837
4 1 2007 120000 1095 136 72956
5 2 2004 35000 800 136 71377
6 3 2004 45000 1000 136 71377
7 3 2005 45000 1000 2584 23135
8 3 2006 45000 1000 2584 23258
9 3 2007 45000 1000 2584 23407
10 4 2005 180000 5000 2584 23254
11 4 2006 220000 5000 2584 23135
12 4 2007 250000 5000 2584 23258
So my problem is that I want to subset data using conditions for both year = 2004 AND (not or) year = 2005. However it doesn't seem to work. Code:
tab3 <- stores[stores$year==2004 & stores$year==2005, c("idnr","year")]
What I am trying to say is that I need to select data which existed in both 2004 and 2005, cause some entries existed either in 2004 or 2005, but not in both and hence should be excluded. Using data above as an example, this should be the output:
idnr year
1 2004
1 2005
3 2004
3 2005
Update:
I was hoping that akrun's method may work for selecting data entries, which appeared ONLY in 2005. Such that:
idnr year
4 2005
Unfortunately, it doesn't. Instead it groups both idnr's which appeared in 2004&2005 with those which appeared only in 2005. Any ideas?
If you want to subset with either year == 2004
or year == 2005
, you need to use the |
operator instead of &
in your actual approach:
tab3 <- stores[stores$year == 2004 | stores$year == 2005, c("idnr", "year")]
Which results:
#> tab3
# idnr year
#1 1 2004
#2 1 2005
#5 2 2004
#6 3 2004
#7 3 2005
#10 4 2005
Or using dplyr
:
library(dplyr)
tab3 <- stores %>% select(idnr, year) %>% filter(year == 2004 | year == 2005)
More concisely:
tab3 <- stores %>% select(idnr, year) %>% filter(year %in% c(2004, 2005))
Here is a an option using "data.table". Convert the dataset ("df") to "data.table" using setDT
. Set the "year" column as "key" ( setkey(..)
). Subset the rows that have "2004/2005" in the "year" columns ( J(c(2004,..)
), select the first two columns 1:2
.
library(data.table) # data.table_1.9.5
DT1 <- setkey(setDT(df),year)[J(c(2004,2005)), 1:2, with=FALSE]
DT1
# idnr year
#1: 1 2004
#2: 2 2004
#3: 3 2004
#4: 1 2005
#5: 3 2005
#6: 4 2005
Based on the updated expected result, we can check whether there are more than one unique "year" entries ( uniqueN(year)>1
) per "idnr" group, get the row index ( .I
) as a column ("V1") and subset the data.table "DT1".
DT1[DT1[, .I[uniqueN(year)>1], idnr]$V1,]
# idnr year
#1: 1 2004
#2: 1 2005
#3: 3 2004
#4: 3 2005
Or everything in one liner
setDT(df)[year %in% 2004:2005, if(uniqueN(year) > 1L) year, idnr]
# idnr V1
# 1: 1 2004
# 2: 1 2005
# 3: 3 2004
# 4: 3 2005
Or a base R
option would be
indx <- with(df, ave(year==2004, idnr, FUN=any)& ave(year==2005,
idnr, FUN=any) & year %in% 2004:2005)
df[indx,1:2]
# idnr year
#1 1 2004
#2 1 2005
#6 3 2004
#7 3 2005
Based on the dataset and the expected result showed, we can check whether the first value of "year" is 2005 for each group "idnr". If it is TRUE, then subset the first observation ( .SD[1L,..]
) and select the columns that are needed.
setDT(df)[,if(year[1L]==2005) .SD[1L,1,with=FALSE], by = idnr]
# idnr year
#1: 4 2005
Or
setDT(df)[df[,.I[year[1L]==2005] , by = idnr]$V1[1L], 1:2, with=FALSE]
# idnr year
#1: 4 2005
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.