简体   繁体   中英

Conditional subsetting gone wrong in R

So I have this fairly basic problem with R subsetting, but because I'm a newbie I don't know how to solve it properly. There's example of some panel data I have:

   idnr year  sales space municipality   pop
 1    1 2004 110000  1095          136 71377
 2    1 2005 110000  1095          136 71355
 3    1 2006 110000  1095          136 71837
 4    1 2007 120000  1095          136 72956
 5    2 2004  35000   800          136 71377
 6    3 2004  45000  1000          136 71377
 7    3 2005  45000  1000         2584 23135
 8    3 2006  45000  1000         2584 23258
 9    3 2007  45000  1000         2584 23407
 10   4 2005 180000  5000         2584 23254
 11   4 2006 220000  5000         2584 23135
 12   4 2007 250000  5000         2584 23258

So my problem is that I want to subset data using conditions for both year = 2004 AND (not or) year = 2005. However it doesn't seem to work. Code:

 tab3 <- stores[stores$year==2004 & stores$year==2005, c("idnr","year")]

What I am trying to say is that I need to select data which existed in both 2004 and 2005, cause some entries existed either in 2004 or 2005, but not in both and hence should be excluded. Using data above as an example, this should be the output:

 idnr year
 1    2004
 1    2005
 3    2004
 3    2005

Update:

I was hoping that akrun's method may work for selecting data entries, which appeared ONLY in 2005. Such that:

 idnr year
 4    2005

Unfortunately, it doesn't. Instead it groups both idnr's which appeared in 2004&2005 with those which appeared only in 2005. Any ideas?

If you want to subset with either year == 2004 or year == 2005 , you need to use the | operator instead of & in your actual approach:

tab3 <- stores[stores$year == 2004 | stores$year == 2005, c("idnr", "year")]

Which results:

#> tab3
#   idnr year
#1     1 2004
#2     1 2005
#5     2 2004
#6     3 2004
#7     3 2005
#10    4 2005

Or using dplyr :

library(dplyr)
tab3 <- stores %>% select(idnr, year) %>% filter(year == 2004 | year == 2005)

More concisely:

tab3 <- stores %>% select(idnr, year) %>% filter(year %in% c(2004, 2005)) 

Here is a an option using "data.table". Convert the dataset ("df") to "data.table" using setDT . Set the "year" column as "key" ( setkey(..) ). Subset the rows that have "2004/2005" in the "year" columns ( J(c(2004,..) ), select the first two columns 1:2 .

library(data.table) # data.table_1.9.5 
DT1 <- setkey(setDT(df),year)[J(c(2004,2005)), 1:2, with=FALSE]
DT1
#    idnr year
#1:    1 2004
#2:    2 2004
#3:    3 2004
#4:    1 2005
#5:    3 2005
#6:    4 2005

Update

Based on the updated expected result, we can check whether there are more than one unique "year" entries ( uniqueN(year)>1 ) per "idnr" group, get the row index ( .I ) as a column ("V1") and subset the data.table "DT1".

 DT1[DT1[, .I[uniqueN(year)>1], idnr]$V1,]
 #     idnr year
 #1:    1 2004
 #2:    1 2005
 #3:    3 2004
 #4:    3 2005

Or everything in one liner

setDT(df)[year %in% 2004:2005, if(uniqueN(year) > 1L) year, idnr]
#    idnr   V1
# 1:    1 2004
# 2:    1 2005
# 3:    3 2004
# 4:    3 2005

Or a base R option would be

 indx <- with(df, ave(year==2004, idnr, FUN=any)& ave(year==2005, 
                     idnr, FUN=any) & year %in% 2004:2005)
 df[indx,1:2]
 #  idnr year
 #1    1 2004
 #2    1 2005
 #6    3 2004
 #7    3 2005

Update2

Based on the dataset and the expected result showed, we can check whether the first value of "year" is 2005 for each group "idnr". If it is TRUE, then subset the first observation ( .SD[1L,..] ) and select the columns that are needed.

   setDT(df)[,if(year[1L]==2005) .SD[1L,1,with=FALSE], by = idnr]
   #   idnr year
   #1:    4 2005

Or

   setDT(df)[df[,.I[year[1L]==2005] , by = idnr]$V1[1L], 1:2, with=FALSE]
   #   idnr year
   #1:    4 2005

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM