简体   繁体   English

在R中的两个数据框中匹配观察值

[英]Match observations in two dataframes in R

I have two dataframes. 我有两个数据框。 I want to use elements from one dataframe to search through a column from the other dataframe. 我想使用一个数据框中的元素来搜索另一数据框中的一列。 And I need to narrow down this dataframe by the matches. 我需要通过匹配来缩小此数据框的范围。 And then continue narrowing down element by element. 然后继续逐个缩小范围。 Look to the sample code, which can explain better. 查看示例代码,它可以更好地解释。

df1    col1   

1      apples      
2      oranges     
3      apples    
4      banana  
5      grapes
6      mangoes
7      oranges
8      banana

df1 has only one column in it. df1中只有一列。 Meanwhile df2 has 2 columns in it. 同时df2中有2列。 setID & col1 setID和col1

df2 setID   col1

1   1   apples      
2   1   oranges     
3   1   oranges
4   1   mangoes
5   1   grapes
6   1   banana  
7   1   banana
8   1   apples    
10  2   apples      
11  2   oranges     
12  2   apples    
13  2   banana  
14  2   grapes
15  2   mangoes
16  2   banana
17  2   oranges
18  3   apples      
19  3   banana  
20  3   oranges     
21  3   apples    
22  3   grapes
23  3   mangoes
24  3   oranges
25  3   banana
26  4   apples      
27  4   oranges     
28  4   apples    
29  4   grapes
30  4   grapes
31  4   oranges     
32  4   banana  
33  4   banana

As you can see there are some repeating setIDs. 如您所见,有一些重复的setID。 They mark one set. 他们标记一组。 The order of the set is important. 集合的顺序很重要。 Please note that the df1$col1 does not have to be the same length as a set from df2. 请注意,df1 $ col1的长度不必与df2的长度相同。 Nor do they have to be an exact match. 它们也不必完全匹配。 They just have to be a close enough match. 他们只需要足够接近。 In this case df1$col1 is closest a match to df2$setID = 2 with only the last two elements out of order. 在这种情况下,df1 $ col1最接近df2 $ setID = 2的匹配项,仅最后两个元素不按顺序排列。 The reason why they dont have to be an exact match is because I want to use a "search as you type" approach. 它们不必完全匹配的原因是因为我想使用“键入时搜索”方法。 I do not want to match df1$col1 as it is to a setID on df2. 我不想将df1 $ col1匹配到df2上的setID。 I want to narrow down the possible set by going through element by element. 我想通过逐个元素地缩小可能的范围。 Assume that you get the elements of df1 one by one and not as a complete dataframe. 假设您一一获得了df1的元素,而不是将其作为一个完整的数据帧。 For example: 例如:

Find a match for df1$col1[1] from df2 and save any sets that contains the match to a tempdf. 从df2中找到df1 $ col1 [1]的匹配项,并将包含该匹配项的所有集合保存到tempdf。 It doesnt matter if a match for df1$col1[1] is found more than once in the same set. 是否在同一集合中找到df1 $ col1 [1]的匹配项不重要。 If it is found at least once then that set will be added to tempdf. 如果至少发现一次,则该集合将添加到tempdf。

What needs to be retrieved at the end is a setID that corresponds to the set that matches as close to df1. 最后需要检索的是一个setID,它与匹配到df1的集合相对应。 In this case the tempdf will be the same as df2 as all the sets include "apples". 在这种情况下,tempdf将与df2相同,因为所有集合都包含“苹果”。 Next will be what matches df1$col1[2] against the tempdf given that the first element is a match. 假设第一个元素是匹配项,则下一个将df1 $ col1 [2]与tempdf进行匹配。 I guess df1$col1[1:2] from tempdf. 我猜是tempdf的df1 $ col1 [1:2]。 This results in: 结果是:

tempdf  setID   col1

1   1   apples      
2   1   oranges     
3   1   oranges
4   1   mangoes
5   1   grapes
6   1   banana  
7   1   banana
8   1   apples    
10  2   apples      
11  2   oranges     
12  2   apples    
13  2   banana  
14  2   grapes
15  2   mangoes
16  2   banana
17  2   oranges
26  4   apples      
27  4   oranges     
28  4   apples    
29  4   grapes
30  4   grapes
31  4   oranges     
32  4   banana  
33  4   banana

Basically setID = 3 is omitted. 基本上省略setID = 3。 As this continues with the 3rd element from df1 the new tempdf will contain only setID 2 & 4. The loop (my thinking to solve this) would end once only one setID remains, in this case setID = 2. Therefore setID = 2 would be considered as a close match for df1. 随着df1的第3个元素的继续,新的tempdf将仅包含setID 2和4。一旦仅保留一个setID(在这种情况下,setID = 2),循环(我认为解决此问题)将结束。因此,setID = 2被视为df1的近距离匹配。

Of course feel free to advice on a better approach than this one. 当然,可以提出比这种方法更好的建议。

You might want to look at the "compare" package, which would allow you to compare allowing for different transformations. 您可能需要查看“比较”包,该包将允许您进行比较以允许进行不同的转换。

Here are a couple of examples to consider.... 这里有几个示例可供考虑...。

Starting sample data. 开始样本数据。 Note setID == 4 , which has all the values, but in the wrong order. 注意setID == 4 ,它具有所有值,但顺序错误。

df1 <- data.frame(col1 = c("apples", "oranges", "apples", "banana"),
                  stringsAsFactors = FALSE)
df1
##      col1
## 1  apples
## 2 oranges
## 3  apples
## 4  banana

df2 <- structure(list(setID = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 
    4, 4, 4, 4), col1 = c("apples", "oranges", "apples", "banana", 
    "apples", "grapes", "oranges", "apples", "oranges", "grapes", 
    "banana", "banana", "apples", "apples", "banana", "oranges")), 
    .Names = c("setID", "col1"), 
    row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
    "9", "10", "11", "12", "13", "21", "31", "41"), class = "data.frame")
df2
##    setID    col1
## 1      1  apples
## 2      1 oranges
## 3      1  apples
## 4      1  banana
## 5      2  apples
## 6      2  grapes
## 7      2 oranges
## 8      2  apples
## 9      3 oranges
## 10     3  grapes
## 11     3  banana
## 12     3  banana
## 13     4  apples
## 21     4  apples
## 31     4  banana
## 41     4 oranges

Load "compare" and do some comparisons: 加载“比较”并进行一些比较:

library(compare)
lapply(split(df2[, "col1", drop = FALSE], df2$setID), 
       function(x) compare(df1, x))
## $`1`
## TRUE
## 
## $`2`
## FALSE [FALSE]
## 
## $`3`
## FALSE [FALSE]
## 
## $`4`
## FALSE [FALSE]
## 

Allow all transformations before comparison (see ?compare for details if you want to allow only for certain transformations). 在比较之前允许所有转换(如果仅想允许某些转换,请参阅?compare了解详细信息)。

lapply(split(df2[, "col1", drop = FALSE], df2$setID), 
       function(x) compare(df1, x, allowAll = TRUE))
## $`1`
## TRUE
## 
## $`2`
## FALSE [FALSE]
##   sorted
##   [col1] ignored case
##   renamed rows
##   [col1] ignored case
##   dropped row names
##   [col1] ignored case
## 
## $`3`
## FALSE [FALSE]
##   sorted
##   [col1] ignored case
##   renamed rows
##   [col1] ignored case
##   dropped row names
##   [col1] ignored case
## 
## $`4`
## TRUE
##   sorted
##   renamed rows
##   dropped row names
## 

using base R: 使用基数R:

split(df2,df2[,1])[by(df2[2],df2[1],function(x)all(x==df1))]
 $`1`
   setID    col1
 1     1  apples
 2     1 oranges
 3     1  apples
 4     1  banana

The OP has requested to find setID groups in df2 where the values in col1 are exactly the same as in df2 . 该OP已要求找到setIDdf2其中值col1完全一样df2

For the sake of completeness, here is also a data.table approach: 为了完整起见,这也是一个data.table方法:

library(data.table)
tmp <- setDT(df2)[, all(col1 == df1$col1), by = setID][(V1)]
tmp
  setID V1 1: 1 TRUE 

Now, the OP has requested to return the matching rows. 现在,OP已请求返回匹配的行。 This can be accomplished by either looking for matching values of setID 可以通过寻找setID匹配值来完成

df2[setID %in% tmp$setID]
  setID col1 1: 1 apples 2: 1 oranges 3: 1 apples 4: 1 banana 

or by joining (which presumably might be faster on large tables) 或通过联接(在大型表上可能更快)

df2[tmp, on = "setID", .SD]

returning the same result. 返回相同的结果。

Caveat 警告

The sample datasets provided by the OP suggests that the number of rows in df1 is the same as in each setID group in df2 . OP提供的样本数据集表明df1中的行数与df2中的每个setID组中的行数相同。 The OP has not specified the expected result in case the number of rows differ. 如果行数不同,则OP未指定预期结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM