[英]Compare multiple columns in 2 different dataframes in R
I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum ( Compare group of two columns and return index matches R ) but this is a different scenario: I am trying to compare if a column in dataframe 1
is between the range of 2 columns in dataframe 2
.我正在尝试比较 R 中两个不同数据帧中的多个列。这在之前的论坛上已经解决( 比较两列的组并返回索引匹配 R )但这是一个不同的场景:我试图比较列是否在
dataframe 1
位于dataframe 2
的 2 列范围之间。 Functions like match, merge, join, intersect
won't work here. match, merge, join, intersect
等功能在这里不起作用。 I have been trying to use purr::pluck
but didn't get far.我一直在尝试使用
purr::pluck
但没有走多远。 The dataframes are of different sizes.数据帧具有不同的大小。
Below is an example:
下面是一个例子:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
我的尝试:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
错误:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
预期结果:
- Compare temp1.df$cyl and temp2.df$Cyl.
比较 temp1.df$cyl 和 temp2.df$Cyl。 If they are match then -->
如果它们匹配,则 -->
- Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
检查 temp1.df$mpg 是否在 temp2.df$Start 和 temp2.df$End 之间 -->
- if it is, then create a new variable new_mpg with value of 1.
如果是,则创建一个值为 1 的新变量 new_mpg。
It's hard to show the exact expected output here.在这里很难显示确切的预期输出。
I realize I could loop this so for each row of temp1.df
but the original temp2.df
has over 250,000 rows.我意识到我可以为
temp1.df
每一行循环这个,但原始temp2.df
有超过 250,000 行。 An efficient solution would be much appreciated.一个有效的解决方案将不胜感激。
Thanks谢谢
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply
, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply
, or maybe changing the organization of it within apply
, eg, by apply(temp1.df[,c("mpg","cyl")]...
.请注意,这对实际数据的组织做出了一些假设(特别是,我无法调用
apply
的列名,因此我使用了索引 - 这可能会发生很大变化,因此您可能想要重新排列数据在接收它和调用apply
,或者可能在apply
改变它的组织,例如,通过apply(temp1.df[,c("mpg","cyl")]...
。
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count.无论如何,这会将您的数据集分成几行,并将每一行与具有相同 Cyl 计数的第二个数据集的子集进行比较。 Within this subset, it checks if
any
of the mpg for this line falls between
(from dplyr
) Start
and End
, and returns 1 if yes (or 0 if no).在这个子集,它会检查是否
any
的MPG此行的下降between
(从dplyr
) Start
和End
,并返回1,如果是(或者0,如果没有)。 All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg
.然后所有这些 1 和 0 作为(命名)向量返回,可以放入
temp1.df$new_mpg
。
I'm guessing there's a way to do this with rowwise
, but I could never get it to work properly...我猜有一种方法可以用
rowwise
做到这rowwise
,但我永远无法让它正常工作......
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.