I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum ( Compare group of two columns and return index matches R ) but this is a different scenario: I am trying to compare if a column in dataframe 1
is between the range of 2 columns in dataframe 2
. Functions like match, merge, join, intersect
won't work here. I have been trying to use purr::pluck
but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
- Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
- Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
- if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df
but the original temp2.df
has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply
, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply
, or maybe changing the organization of it within apply
, eg, by apply(temp1.df[,c("mpg","cyl")]...
.
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any
of the mpg for this line falls between
(from dplyr
) Start
and End
, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg
.
I'm guessing there's a way to do this with rowwise
, but I could never get it to work properly...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.