比较 R 中 2 个不同数据帧中的多列

Question

I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum ( Compare group of two columns and return index matches R ) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2 .我正在尝试比较 R 中两个不同数据帧中的多个列。这在之前的论坛上已经解决（比较两列的组并返回索引匹配 R ）但这是一个不同的场景：我试图比较列是否在dataframe 1位于dataframe 2的 2 列范围之间。 Functions like match, merge, join, intersect won't work here. match, merge, join, intersect等功能在这里不起作用。 I have been trying to use purr::pluck but didn't get far.我一直在尝试使用purr::pluck但没有走多远。 The dataframes are of different sizes.数据帧具有不同的大小。

Below is an example:下面是一个例子：

temp1.df <- mtcars

temp2.df <- data.frame(
  Cyl = sample (4:8, 100, replace = TRUE),
  Start = sample (1:22, 100, replace = TRUE),
  End = sample (1:22, 100, replace = TRUE)
)

temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)

My attempt:我的尝试：

temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
  temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))

Error:错误：

Error in mutate_impl(.data, dots) : 
  Column `new_mpg` must be length 32 (the number of rows) or one, not 100

Expected Result:预期结果：

Compare temp1.df$cyl and temp2.df$Cyl.比较 temp1.df$cyl 和 temp2.df$Cyl。 If they are match then -->如果它们匹配，则 -->

Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->检查 temp1.df$mpg 是否在 temp2.df$Start 和 temp2.df$End 之间 -->

if it is, then create a new variable new_mpg with value of 1.如果是，则创建一个值为 1 的新变量 new_mpg。

It's hard to show the exact expected output here.在这里很难显示确切的预期输出。

I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows.我意识到我可以为temp1.df每一行循环这个，但原始temp2.df有超过 250,000 行。 An efficient solution would be much appreciated.一个有效的解决方案将不胜感激。

Thanks谢谢

Answer 1

temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
  temp<-temp2.df[temp2.df$Cyl==x[2],] 
  ifelse(any(apply(temp, 1, function(y) {
    dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
  })),1,0)
})

Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply , so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply , or maybe changing the organization of it within apply , eg, by apply(temp1.df[,c("mpg","cyl")]... .请注意，这对实际数据的组织做出了一些假设（特别是，我无法调用apply的列名，因此我使用了索引 - 这可能会发生很大变化，因此您可能想要重新排列数据在接收它和调用apply ，或者可能在apply改变它的组织，例如，通过apply(temp1.df[,c("mpg","cyl")]... 。

At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count.无论如何，这会将您的数据集分成几行，并将每一行与具有相同 Cyl 计数的第二个数据集的子集进行比较。 Within this subset, it checks if any of the mpg for this line falls between (from dplyr ) Start and End , and returns 1 if yes (or 0 if no).在这个子集，它会检查是否any的MPG此行的下降between （从dplyr ） Start和End ，并返回1，如果是（或者0，如果没有）。 All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg .然后所有这些 1 和 0 作为（命名）向量返回，可以放入temp1.df$new_mpg 。

I'm guessing there's a way to do this with rowwise , but I could never get it to work properly...我猜有一种方法可以用rowwise做到这rowwise ，但我永远无法让它正常工作......

比较 R 中 2 个不同数据帧中的多列

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-11-14 20:01:48

比较 R 中 2 个不同数据帧中的多列

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-11-14 20:01:48

解决方案1
1 已采纳 2018-11-14 20:01:48