简体   繁体   中英

How can I subset a dataframe based on two simultaneously fulfilled conditions in another dataframe in R?

This is my first question on stack overflow, so please let me know if there is further information that would be required to answer the question. I have started learning R very recently so I kindly ask for your patience.

I have a data frame Df1 which I want to subset/filter based on two simultaneous conditions:

  • Is the company code also present in Df2?
  • in this case, is the date also the same for the two dataframes?

I have tried the following code

Sub <- subset(Df1, Df1$CompanyCode %in% Df2$CompanyCode & year(Df1$Date) %in% Df2$Year)

I think I know where the problem is but I don't know how to fix it. I think the formula above checks individually both "%in%" conditions and therefore returns to many cases.

To give a concrete example (see below; EDIT: as requested now as dput output ): it would expect to not have row #4 in Df1 in my result because there is no matching case in Df2. However, it is part of the resulting subset. I guess because it can find a match for both company code and the date individually , ie it can find company "B" in Df" and it can find the year 2016 in Df2. However, this is not what I want because there is no perfect match, having these two conditions fulfilled at the same time .

Df1 (Input1):

structure(list(CompanyCode = c("A", "A", "B", "B", "C", "D"), 
    Date = structure(c(16800, 17166, 16800, 17166, 16800, 17166
    ), class = "Date")), row.names = c(NA, -6L), class = "data.frame")

Df2 (Input 2):

structure(list(CompanyCode = c("A", "A", "B", "C", "D"), Year = c(2015L, 
2016L, 2015L, 2015L, 2016L)), class = "data.frame", row.names = c(NA, 
-5L))

Sub (Actual Output):

structure(list(CompanyCode = c("A", "A", "B", "B", "C", "D"), 
    Date = structure(c(16800, 17166, 16800, 17166, 16800, 17166
    ), class = "Date")), row.names = c(NA, 6L), class = "data.frame")

ExpectedSub (Expected Output):

structure(list(CompanyCode = c("A", "A", "B", "C", "D"), Date = structure(c(16800, 
17166, 16800, 16800, 17166), class = "Date")), row.names = c(NA, 
-5L), class = "data.frame")

I would greatly appreciate if you could help me out here. Hopefully this example made my problem clear.

Many thanks in advance!

one more way..


library(dplyr)
library(lubridate)
df1 %>% mutate(Year = year(as.Date(Date))) %>%
  right_join(df2, by = c("CompanyCode" = "CompanyCode", "Year" = "Year"))

  CompanyCode       Date Year
1           A 2015-12-31 2015
2           A 2016-12-31 2016
3           B 2015-12-31 2015
4           C 2015-12-31 2015
5           D 2016-12-31 2016

You can paste CompanyCode and year value to create an unique key between and use %in% to keep only those keys which are present df2 .

result <- subset(df1, paste(CompanyCode, format(Date, '%Y')) %in% 
                      paste(df2$CompanyCode, df2$Year))
result

#  CompanyCode       Date
#1           A 2015-12-31
#2           A 2016-12-31
#3           B 2015-12-31
#5           C 2015-12-31

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM