[英]Complex join between two dataframes
I am working on a very advanced join of dataframes that is complex for me.我正在研究对我来说很复杂的非常高级的数据帧连接。 I would like to ask you for some help if possible.如果可能的话,我想请你帮忙。 I have two dataframes, df1
and df2
which I include at the end as dput()
.我有两个数据帧df1
和df2
,我在末尾将其包含为dput()
。 My first dataframe df1
looks like this:我的第一个 dataframe df1
看起来像这样:
df1
name Var
1 RODRIGUEZ PAREDES MARIA BELEN 1
2 VALLEJO JANETA PAOLO ALEXANDER 1
3 MORALES JADAN DIANA 1
4 FREIRE PASPUEL BYRON 1
5 ORTIZ PRADO ESTEBAN 1
6 HENRIQUEZ TRUJILLO AQUILES 1
7 RIVERA OLIVERO ISMAR 1
8 JARAMILLO TATIANA 1
9 LOZADA TANNYA 1
10 GARCIA BEREGUIAIN MIGUEL ANGEL 1
It is mainly composed of latin names and a variable.它主要由拉丁名称和一个变量组成。
The second dataframe df2
looks like this:第二个 dataframe df2
看起来像这样:
df2
name Val1 Val2
1 RODRIGUEZ PAREDES MARIA BELEN a b
2 RODRIGUEZ MARIA BELEN c b
3 VALLEJO JANETA PAOLO ALEXANDER a a
4 VALLEJO ALEXANDER b b
5 MORALES JADAN DIANA a a
6 FREIRE PASPUEL BYRON d c
7 FREIRE BYRON a c
8 ORTIZ PRADO ESTEBAN a a
9 ORTIZ-PRADO ESTEBAN a a
10 HENRIQUEZ TRUJILLO AQUILES b b
11 HENRIQUEZ AQUILES a b
12 HENRIQUEZ A. c c
13 JARAMILLO VIVANCO TATIANA a a
14 JARAMILLO T. a b
15 LOZADA TANNYA a a
16 GARCIA BEREGUIAIN MIGUEL ANGEL b b
17 GARCIA MIGUEL ANGEL a a
This dataframe is essential because it contains additional information.这个 dataframe 是必不可少的,因为它包含其他信息。 Now, I will describe my main issue.现在,我将描述我的主要问题。 I need to join these two dataframes in a complex task and compute a variable about the number of similar observations.我需要将这两个数据框加入到一个复杂的任务中,并计算一个关于相似观察次数的变量。 Both of them have the key name
which will be used for the merge but the join is very troublesome.两者都有用于合并的键name
,但连接起来很麻烦。 I will explain better with an example.我会用一个例子更好地解释。 Let's take the name RODRIGUEZ PAREDES MARIA BELEN
from df1
.让我们使用df1
中的名称RODRIGUEZ PAREDES MARIA BELEN
。 I need to merge with df2
and compute a variable Number
which tells how many similar names exist.我需要与df2
合并并计算一个变量Number
,它告诉有多少相似的名字存在。 In this case RODRIGUEZ PAREDES MARIA BELEN
is similar/identical to RODRIGUEZ PAREDES MARIA BELEN
and RODRIGUEZ MARIA BELEN
from df2
so Number
should be equal to 2. In addition, after the comparison I need to bring the variables that match the name.在这种情况下, RODRIGUEZ PAREDES MARIA BELEN
BELEN 与df2
中的RODRIGUEZ PAREDES MARIA BELEN
和RODRIGUEZ MARIA BELEN
相似/相同,因此Number
应等于 2。此外,在比较之后,我需要带上与名称匹配的变量。 So for RODRIGUEZ PAREDES MARIA BELEN
we would have a
and b
in Val1
and Val2
.因此,对于RODRIGUEZ PAREDES MARIA BELEN
,我们将在Val1
和Val2
中有a
和b
。
This is a complex computation for the variable name
and also I do not know which kind of join I should use to bring the other variables.这是变量name
的复杂计算,而且我不知道应该使用哪种连接来引入其他变量。
Also, there is a consideration.另外,还有一个考虑。 For example in the case of name JARAMILLO TATIANA
if we compare with df2
, we cannot find it.例如,在名称JARAMILLO TATIANA
的情况下,如果我们与df2
进行比较,我们找不到它。 The variable Number
should be 2 because there are two similar names, but in the case of the variables Val1
and Val2
, they must contain the values of the first closest/identical string found in df2
.变量Number
应该是 2 因为有两个相似的名称,但是对于变量Val1
和Val2
,它们必须包含在df2
中找到的第一个最接近/相同的字符串的值。 So for this name, we would have Val1=a
and Val2=a
because the similar match was found with JARAMILLO VIVANCO TATIANA
in df2
.所以对于这个名字,我们会有Val1=a
和Val2=a
因为在df2
中发现了与JARAMILLO VIVANCO TATIANA
的相似匹配。
In the end I would like to have a new dataframe like this:最后我想有一个新的 dataframe 是这样的:
name Var Number Val1 Val2
1 RODRIGUEZ PAREDES MARIA BELEN 1 2 a b
2 VALLEJO JANETA PAOLO ALEXANDER 1 2 a a
3 MORALES JADAN DIANA 1 1 a a
4 FREIRE PASPUEL BYRON 1 2 d c
5 ORTIZ PRADO ESTEBAN 1 2 a a
6 HENRIQUEZ TRUJILLO AQUILES 1 3 b b
7 RIVERA OLIVERO ISMAR 1 0 <NA> <NA>
8 JARAMILLO TATIANA 1 2 a a
9 LOZADA TANNYA 1 1 a a
10 GARCIA BEREGUIAIN MIGUEL ANGEL 1 2 b b
I have tried with left_join()
or merge()
but it is not possible to complete all variables, overall Number
.我尝试过使用left_join()
或merge()
但不可能完成所有变量,整体Number
。 Also, I checked fuzzyjoin
package but it is not clear for me how to use this.另外,我检查了fuzzyjoin
package 但我不清楚如何使用它。 If possible I would prefer a dplyr
solution, a quick base
solution or a fuzzyjoin
solution would be great.如果可能的话,我更喜欢dplyr
解决方案,快速base
解决方案或fuzzyjoin
解决方案会很棒。
Many thanks.非常感谢。 The data is next:接下来是数据:
#Data 1
df1 <- structure(list(name = c("RODRIGUEZ PAREDES MARIA BELEN", "VALLEJO JANETA PAOLO ALEXANDER",
"MORALES JADAN DIANA", "FREIRE PASPUEL BYRON", "ORTIZ PRADO ESTEBAN",
"HENRIQUEZ TRUJILLO AQUILES", "RIVERA OLIVERO ISMAR", "JARAMILLO TATIANA",
"LOZADA TANNYA", "GARCIA BEREGUIAIN MIGUEL ANGEL"), Var = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
#Data 2
df2 <- structure(list(name = c("RODRIGUEZ PAREDES MARIA BELEN", "RODRIGUEZ MARIA BELEN",
"VALLEJO JANETA PAOLO ALEXANDER", "VALLEJO ALEXANDER", "MORALES JADAN DIANA",
"FREIRE PASPUEL BYRON", "FREIRE BYRON", "ORTIZ PRADO ESTEBAN",
"ORTIZ-PRADO ESTEBAN", "HENRIQUEZ TRUJILLO AQUILES", "HENRIQUEZ AQUILES",
"HENRIQUEZ A.", "JARAMILLO VIVANCO TATIANA", "JARAMILLO T.",
"LOZADA TANNYA", "GARCIA BEREGUIAIN MIGUEL ANGEL", "GARCIA MIGUEL ANGEL"
), Val1 = c("a", "c", "a", "b", "a", "d", "a", "a", "a", "b",
"a", "c", "a", "a", "a", "b", "a"), Val2 = c("b", "b", "a", "b",
"a", "c", "c", "a", "a", "b", "b", "c", "a", "b", "a", "b", "a"
)), class = "data.frame", row.names = c(NA, -17L))
Here is a suggestion.这是一个建议。 It is not the eaxact solution.这不是精确的解决方案。 But I think it will bring you further:但我认为它会让你更进一步:
You will notice that it differs.你会注意到它的不同。 But You can go through the code line by line and play with max_dist =.2 or max_dist =.2 or try other method etc...但是您可以 go 逐行通过代码并使用 max_dist =.2 或 max_dist =.2 或尝试其他方法等...
By going through the lines you will see which name of df2 is matched to name df1:通过浏览这些行,您将看到 df2 的哪个名称与名称 df1 匹配:
library(dplyr)
library(fuzzyjoin)
fuzzyjoin::stringdist_left_join(x=df1, y=df2, max_dist = .2,
by="name",
method = 'jaccard',
distance_col = "dist") %>%
mutate(id = row_number()) %>%
group_by(name.x) %>%
add_count(name="Number") %>%
mutate(Number= ifelse(is.na(dist), 0, Number)) %>%
arrange(id) %>%
filter(dist == min(dist) | is.na(dist))
name Var Number Val1 Val2
<chr> <int> <dbl> <chr> <chr>
1 RODRIGUEZ PAREDES MARIA BELEN 1 2 a b
2 VALLEJO JANETA PAOLO ALEXANDER 1 2 a a
3 MORALES JADAN DIANA 1 1 a a
4 FREIRE PASPUEL BYRON 1 1 d c
5 ORTIZ PRADO ESTEBAN 1 2 a a
6 HENRIQUEZ TRUJILLO AQUILES 1 2 b b
7 RIVERA OLIVERO ISMAR 1 0 NA NA
8 JARAMILLO TATIANA 1 2 a a
9 LOZADA TANNYA 1 1 a a
10 GARCIA BEREGUIAIN MIGUEL ANGEL 1 2 b b
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.