[英]Combine fuzzy and exact merge in R
Here is my sample data:这是我的示例数据:
a <- data.frame(name = c('Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana',
'Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana'),
date=c(1991,1991,1991,1991,1991,1991,
1992,1992,1992,1992,1992,1992),
price = c(10, 13, 2, 1, 15, 1,
11,15,3,2,14,4))
b <- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
'Ace Co.', 'Bayes INC.', 'asd'),
date=c(1991,1991,1991,1992,1992,1992),
qty = c(9, 99, 10,10,105,15))
I am left joining a to b by date and name, date is exact while name is fuzzy.我只剩下按日期和名称将 a 加入到 b,日期是准确的,而名称是模糊的。 I have tried stringdist_join but it only accomdates fuzzy merge.
我试过 stringdist_join 但它只适应模糊合并。
The expected output is as follows:预期输出如下:
c<- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
'Ace Co.', 'Bayes INC.', 'asd'),
date=c(1991,1991,1991,1992,1992,1992),
qty = c(9, 99, 10,10,105,15),
price = c(10, 13, 2,11,15,3))
I'd like to manipulate it under dplyr.我想在 dplyr 下操作它。
agrep
solution agrep
解决方案The following function is almost surely not as general as it is supposed to be.下面的函数几乎肯定不像它应该的那样通用。 But here it goes.
但它来了。
funMerge <- function(X, Y, col, col_approx, sep = "."){
other_cols.x <- setdiff(names(X), c(col, col_approx))
other_cols.y <- setdiff(names(Y), c(col, col_approx))
sp.x <- split(X, X[[col]])
sp.y <- split(Y, Y[[col]])
common_names <- intersect(names(sp.x), names(sp.y))
res <- sapply(common_names, function(sp.name){
x <- sp.x[[sp.name]]
y <- sp.y[[sp.name]]
k <- sapply(x[[col_approx]], agrep, y[[col_approx]])
k <- k[sapply(k, length) > 0]
k <- unlist(k)
i <- match(names(k), x[[col_approx]])
df_other.x <- x[k, other_cols.x, drop = FALSE]
df_other.y <- y[k, other_cols.y, drop = FALSE]
df_tmp <- data.frame(
x[k, col],
names(k),
y[k, col_approx]
)
names(df_tmp) <- c(col, col_approx, paste(col_approx, "y", sep = sep))
cbind(df_tmp, df_other.x, df_other.y)
}, simplify = FALSE)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
funMerge(a, b, col = "date", col_approx = "name")
# date name name.y price qty
#1 1991 Ace Co Ace Co. 10 9
#2 1991 Bayes Bayes Inc. 13 99
#3 1991 asd asdf 2 10
#4 1992 Ace Co Ace CO. 11 10
#5 1992 Bayes Bayes INC. 15 105
#6 1992 asd aasdf 3 15
stringdist
solution stringdist
解决方案The following function uses package stringdist
to compute the Jaro-Winkler pairwise distances between the columns that need to be matched approximately.以下函数使用包
stringdist
计算需要近似匹配的列之间的 Jaro-Winkler 成对距离。
From help('stringdist-metrics')
, my emphasis.从
help('stringdist-metrics')
,我的重点。
The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring.
您需要为应用程序选择的指标在很大程度上取决于字符串的性质(字符串代表什么?)以及您测量的字符串之间不同的原因。 For example, if you are comparing human-typed names that may contain typo's, the Jaro-Winkler distance may be of use.
例如,如果您正在比较可能包含拼写错误的人工输入名称,那么 Jaro-Winkler 距离可能会有用。 If you are comparing names that were written down after hearing them, a phonetic distance may be a better choice.
如果您要比较听过之后写下的名字,音标距离可能是更好的选择。
A more efficient algorithm would be to first split the data sets by the exact match column and then apply the method of funMerge2
.更有效的算法是首先按精确匹配列拆分数据集,然后应用
funMerge2
的方法。
library(stringdist)
funMerge2 <- function(X, Y, col, col_approx, method = "jw", threshold = 0.2){
x <- X[[col_approx]]
y <- Y[[col_approx]]
d <- stringdistmatrix(x, y, method = method, useBytes = FALSE)
w <- which(d < threshold, arr.ind = TRUE)
Z1 <- X[w[, "row"], ]
Z2 <- Y[w[, "col"], ]
res <- cbind(Z1, Z2)
common_cols <- grep(col, names(res))
res <- res[apply(res[, common_cols], 1, function(x) x[1] == x[2]), ]
row.names(res) <- NULL
res
}
funMerge2(a, b, col = "date", col_approx = "name")
# name date price name date qty
#1 Ace Co 1991 10 Ace Co. 1991 9
#2 Bayes 1991 13 Bayes Inc. 1991 99
#3 asd 1991 2 asdf 1991 10
#4 Ace Co 1992 11 Ace CO. 1992 10
#5 Bayes 1992 15 Bayes INC. 1992 105
#6 asd 1992 3 aasdf 1992 15
Main principle主要原理
Get the distance matrix between each unique terms of you vectors.获取向量的每个唯一项之间的距离矩阵。 Then, check what threshold might lead to the best results (this has to be human supervised I think).
然后,检查什么阈值可能会导致最佳结果(我认为这必须是人工监督的)。 Then, use this new correspondance table to merge your dataframes.
然后,使用这个新的对应表来合并您的数据框。 Finallyyou can change names (ie adding "inc.") easier because you have "standardized" names.
最后,您可以更轻松地更改名称(即添加“inc.”),因为您拥有“标准化”名称。
With utils::adist()
使用
utils::adist()
I think stringdist
is better because you can choose the method, but here is a base
example as a suggestion on how to use this concept of distance to get the expected output.我认为
stringdist
更好,因为您可以选择该方法,但这里有一个base
示例,作为关于如何使用这种距离概念来获得预期输出的建议。
# 1st create a matrix with the Standard Levenshtein distance between the name fields of both sources (or other method from stringdist)
dist_name_matrix <- adist(unique(a$name), unique(b$name), partial = TRUE, ignore.case = TRUE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)
# lets convert this matrix to a dataframe for more visual changes, you will need to check it yourself
library(dplyr)
library(tidyr)
dist_df <- dist_name_matrix %>%
as.data.frame() %>%
tibble::rownames_to_column(., "a_name") %>%
pivot_longer(cols = 2:last_col(), names_to = "b_name", values_to = "dist") %>%
filter(dist < 2) # you might need to adapt this to your needs
# Now this can be used to merge your data i.e
a %>%
left_join(., dist_df, by = c("name" = "a_name")) %>%
right_join(., b, by = c("b_name" = "name", "date" = "date")) %>%
# added just to match your expected output
filter(!is.na(name)) %>%
select(b_name, date, qty, price)
Output:输出:
b_name date qty price
1 Ace Co. 1991 9 10
2 Bayes INC. 1991 99 13
3 asd 1991 10 2
4 Ace Co. 1992 10 11
5 Bayes INC. 1992 105 15
6 asd 1992 15 3
Same process can be used with stringdist
:同样的过程可以与
stringdist
一起使用:
library(stringdist)
dist_name_matrix <- stringdistmatrix(unique(a$name), unique(b$name), method = "jw", useBytes = FALSE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)
Then just adapt the threshold after human check ie filter(dist < 0.2)
然后在人工检查后调整阈值即
filter(dist < 0.2)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.