简体   繁体   English

两个数据框之间的复杂连接

[英]Complex join between two dataframes

I am working on a very advanced join of dataframes that is complex for me.我正在研究对我来说很复杂的非常高级的数据帧连接。 I would like to ask you for some help if possible.如果可能的话,我想请你帮忙。 I have two dataframes, df1 and df2 which I include at the end as dput() .我有两个数据帧df1df2 ,我在末尾将其包含为dput() My first dataframe df1 looks like this:我的第一个 dataframe df1看起来像这样:

df1
                             name Var
1   RODRIGUEZ PAREDES MARIA BELEN   1
2  VALLEJO JANETA PAOLO ALEXANDER   1
3             MORALES JADAN DIANA   1
4            FREIRE PASPUEL BYRON   1
5             ORTIZ PRADO ESTEBAN   1
6      HENRIQUEZ TRUJILLO AQUILES   1
7            RIVERA OLIVERO ISMAR   1
8               JARAMILLO TATIANA   1
9                   LOZADA TANNYA   1
10 GARCIA BEREGUIAIN MIGUEL ANGEL   1

It is mainly composed of latin names and a variable.它主要由拉丁名称和一个变量组成。

The second dataframe df2 looks like this:第二个 dataframe df2看起来像这样:

df2
                             name Val1 Val2
1   RODRIGUEZ PAREDES MARIA BELEN    a    b
2           RODRIGUEZ MARIA BELEN    c    b
3  VALLEJO JANETA PAOLO ALEXANDER    a    a
4               VALLEJO ALEXANDER    b    b
5             MORALES JADAN DIANA    a    a
6            FREIRE PASPUEL BYRON    d    c
7                    FREIRE BYRON    a    c
8             ORTIZ PRADO ESTEBAN    a    a
9             ORTIZ-PRADO ESTEBAN    a    a
10     HENRIQUEZ TRUJILLO AQUILES    b    b
11              HENRIQUEZ AQUILES    a    b
12                   HENRIQUEZ A.    c    c
13      JARAMILLO VIVANCO TATIANA    a    a
14                   JARAMILLO T.    a    b
15                  LOZADA TANNYA    a    a
16 GARCIA BEREGUIAIN MIGUEL ANGEL    b    b
17            GARCIA MIGUEL ANGEL    a    a

This dataframe is essential because it contains additional information.这个 dataframe 是必不可少的,因为它包含其他信息。 Now, I will describe my main issue.现在,我将描述我的主要问题。 I need to join these two dataframes in a complex task and compute a variable about the number of similar observations.我需要将这两个数据框加入到一个复杂的任务中,并计算一个关于相似观察次数的变量。 Both of them have the key name which will be used for the merge but the join is very troublesome.两者都有用于合并的键name ,但连接起来很麻烦。 I will explain better with an example.我会用一个例子更好地解释。 Let's take the name RODRIGUEZ PAREDES MARIA BELEN from df1 .让我们使用df1中的名称RODRIGUEZ PAREDES MARIA BELEN I need to merge with df2 and compute a variable Number which tells how many similar names exist.我需要与df2合并并计算一个变量Number ,它告诉有多少相似的名字存在。 In this case RODRIGUEZ PAREDES MARIA BELEN is similar/identical to RODRIGUEZ PAREDES MARIA BELEN and RODRIGUEZ MARIA BELEN from df2 so Number should be equal to 2. In addition, after the comparison I need to bring the variables that match the name.在这种情况下, RODRIGUEZ PAREDES MARIA BELEN BELEN 与df2中的RODRIGUEZ PAREDES MARIA BELENRODRIGUEZ MARIA BELEN相似/相同,因此Number应等于 2。此外,在比较之后,我需要带上与名称匹配的变量。 So for RODRIGUEZ PAREDES MARIA BELEN we would have a and b in Val1 and Val2 .因此,对于RODRIGUEZ PAREDES MARIA BELEN ,我们将在Val1Val2中有ab

This is a complex computation for the variable name and also I do not know which kind of join I should use to bring the other variables.这是变量name的复杂计算,而且我不知道应该使用哪种连接来引入其他变量。

Also, there is a consideration.另外,还有一个考虑。 For example in the case of name JARAMILLO TATIANA if we compare with df2 , we cannot find it.例如,在名称JARAMILLO TATIANA的情况下,如果我们与df2进行比较,我们找不到它。 The variable Number should be 2 because there are two similar names, but in the case of the variables Val1 and Val2 , they must contain the values of the first closest/identical string found in df2 .变量Number应该是 2 因为有两个相似的名称,但是对于变量Val1Val2 ,它们必须包含在df2中找到的第一个最接近/相同的字符串的值。 So for this name, we would have Val1=a and Val2=a because the similar match was found with JARAMILLO VIVANCO TATIANA in df2 .所以对于这个名字,我们会有Val1=aVal2=a因为在df2中发现了与JARAMILLO VIVANCO TATIANA的相似匹配。

In the end I would like to have a new dataframe like this:最后我想有一个新的 dataframe 是这样的:

                             name Var Number Val1 Val2
1   RODRIGUEZ PAREDES MARIA BELEN   1      2    a    b
2  VALLEJO JANETA PAOLO ALEXANDER   1      2    a    a
3             MORALES JADAN DIANA   1      1    a    a
4            FREIRE PASPUEL BYRON   1      2    d    c
5             ORTIZ PRADO ESTEBAN   1      2    a    a
6      HENRIQUEZ TRUJILLO AQUILES   1      3    b    b
7            RIVERA OLIVERO ISMAR   1      0 <NA> <NA>
8               JARAMILLO TATIANA   1      2    a    a
9                   LOZADA TANNYA   1      1    a    a
10 GARCIA BEREGUIAIN MIGUEL ANGEL   1      2    b    b

I have tried with left_join() or merge() but it is not possible to complete all variables, overall Number .我尝试过使用left_join()merge()但不可能完成所有变量,整体Number Also, I checked fuzzyjoin package but it is not clear for me how to use this.另外,我检查了fuzzyjoin package 但我不清楚如何使用它。 If possible I would prefer a dplyr solution, a quick base solution or a fuzzyjoin solution would be great.如果可能的话,我更喜欢dplyr解决方案,快速base解决方案或fuzzyjoin解决方案会很棒。

Many thanks.非常感谢。 The data is next:接下来是数据:

#Data 1
df1 <- structure(list(name = c("RODRIGUEZ PAREDES MARIA BELEN", "VALLEJO JANETA PAOLO ALEXANDER", 
"MORALES JADAN DIANA", "FREIRE PASPUEL BYRON", "ORTIZ PRADO ESTEBAN", 
"HENRIQUEZ TRUJILLO AQUILES", "RIVERA OLIVERO ISMAR", "JARAMILLO TATIANA", 
"LOZADA TANNYA", "GARCIA BEREGUIAIN MIGUEL ANGEL"), Var = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA, 
-10L))

#Data 2
df2 <- structure(list(name = c("RODRIGUEZ PAREDES MARIA BELEN", "RODRIGUEZ MARIA BELEN", 
"VALLEJO JANETA PAOLO ALEXANDER", "VALLEJO ALEXANDER", "MORALES JADAN DIANA", 
"FREIRE PASPUEL BYRON", "FREIRE BYRON", "ORTIZ PRADO ESTEBAN", 
"ORTIZ-PRADO ESTEBAN", "HENRIQUEZ TRUJILLO AQUILES", "HENRIQUEZ AQUILES", 
"HENRIQUEZ A.", "JARAMILLO VIVANCO TATIANA", "JARAMILLO T.", 
"LOZADA TANNYA", "GARCIA BEREGUIAIN MIGUEL ANGEL", "GARCIA MIGUEL ANGEL"
), Val1 = c("a", "c", "a", "b", "a", "d", "a", "a", "a", "b", 
"a", "c", "a", "a", "a", "b", "a"), Val2 = c("b", "b", "a", "b", 
"a", "c", "c", "a", "a", "b", "b", "c", "a", "b", "a", "b", "a"
)), class = "data.frame", row.names = c(NA, -17L))

Here is a suggestion.这是一个建议。 It is not the eaxact solution.这不是精确的解决方案。 But I think it will bring you further:但我认为它会让你更进一步:

You will notice that it differs.你会注意到它的不同。 But You can go through the code line by line and play with max_dist =.2 or max_dist =.2 or try other method etc...但是您可以 go 逐行通过代码并使用 max_dist =.2 或 max_dist =.2 或尝试其他方法等...

By going through the lines you will see which name of df2 is matched to name df1:通过浏览这些行,您将看到 df2 的哪个名称与名称 df1 匹配:

library(dplyr)
library(fuzzyjoin)

fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = .2, 
                                by="name", 
                                method = 'jaccard', 
                                distance_col = "dist") %>%  
  mutate(id = row_number()) %>% 
  group_by(name.x) %>%   
  add_count(name="Number") %>% 
  mutate(Number= ifelse(is.na(dist), 0, Number)) %>% 
  arrange(id) %>% 
  filter(dist == min(dist) | is.na(dist))
  name                             Var Number Val1  Val2 
   <chr>                          <int>  <dbl> <chr> <chr>
 1 RODRIGUEZ PAREDES MARIA BELEN      1      2 a     b    
 2 VALLEJO JANETA PAOLO ALEXANDER     1      2 a     a    
 3 MORALES JADAN DIANA                1      1 a     a    
 4 FREIRE PASPUEL BYRON               1      1 d     c    
 5 ORTIZ PRADO ESTEBAN                1      2 a     a    
 6 HENRIQUEZ TRUJILLO AQUILES         1      2 b     b    
 7 RIVERA OLIVERO ISMAR               1      0 NA    NA   
 8 JARAMILLO TATIANA                  1      2 a     a    
 9 LOZADA TANNYA                      1      1 a     a    
10 GARCIA BEREGUIAIN MIGUEL ANGEL     1      2 b     b   

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM