简体   繁体   English

在 R 中结合模糊和精确合并

[英]Combine fuzzy and exact merge in R

Here is my sample data:这是我的示例数据:

a <- data.frame(name = c('Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana',
                         'Ace CO', 'Bayes', 'aasd', 'Apple', 'Orange', 'Banana'),
                date=c(1991,1991,1991,1991,1991,1991,
                       1992,1992,1992,1992,1992,1992),
                price = c(10, 13, 2, 1, 15, 1,
                          11,15,3,2,14,4))
b <- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
                         'Ace Co.', 'Bayes INC.', 'asd'),
                date=c(1991,1991,1991,1992,1992,1992),
                qty = c(9, 99, 10,10,105,15))

I am left joining a to b by date and name, date is exact while name is fuzzy.我只剩下按日期和名称将 a 加入到 b,日期是准确的,而名称是模糊的。 I have tried stringdist_join but it only accomdates fuzzy merge.我试过 stringdist_join 但它只适应模糊合并。

The expected output is as follows:预期输出如下:

c<- data.frame(name = c('Ace Co.', 'Bayes INC.', 'asd',
                         'Ace Co.', 'Bayes INC.', 'asd'),
                date=c(1991,1991,1991,1992,1992,1992),
                qty = c(9, 99, 10,10,105,15),
                price = c(10, 13, 2,11,15,3))

I'd like to manipulate it under dplyr.我想在 dplyr 下操作它。

agrep solution agrep解决方案

The following function is almost surely not as general as it is supposed to be.下面的函数几乎肯定不像它应该的那样通用。 But here it goes.但它来了。

funMerge <- function(X, Y, col, col_approx, sep = "."){
  other_cols.x <- setdiff(names(X), c(col, col_approx))
  other_cols.y <- setdiff(names(Y), c(col, col_approx))
  sp.x <- split(X, X[[col]])
  sp.y <- split(Y, Y[[col]])
  common_names <- intersect(names(sp.x), names(sp.y))


  res <- sapply(common_names, function(sp.name){
    x <- sp.x[[sp.name]]
    y <- sp.y[[sp.name]]
    k <- sapply(x[[col_approx]], agrep, y[[col_approx]])
    k <- k[sapply(k, length) > 0]
    k <- unlist(k)
    i <- match(names(k), x[[col_approx]])
    df_other.x <- x[k, other_cols.x, drop = FALSE]
    df_other.y <- y[k, other_cols.y, drop = FALSE]
    df_tmp <- data.frame(
      x[k, col], 
      names(k), 
      y[k, col_approx]
    )
    names(df_tmp) <- c(col, col_approx, paste(col_approx, "y", sep = sep))
    cbind(df_tmp, df_other.x, df_other.y)
  }, simplify = FALSE)
  res <- do.call(rbind, res)
  row.names(res) <- NULL
  res
}

funMerge(a, b, col = "date", col_approx = "name")
#  date   name     name.y price qty
#1 1991 Ace Co    Ace Co.    10   9
#2 1991  Bayes Bayes Inc.    13  99
#3 1991    asd       asdf     2  10
#4 1992 Ace Co    Ace CO.    11  10
#5 1992  Bayes Bayes INC.    15 105
#6 1992    asd      aasdf     3  15

stringdist solution stringdist解决方案

The following function uses package stringdist to compute the Jaro-Winkler pairwise distances between the columns that need to be matched approximately.以下函数使用包stringdist计算需要近似匹配的列之间的 Jaro-Winkler 成对距离。

From help('stringdist-metrics') , my emphasis.help('stringdist-metrics') ,我的重点。

The metric you need to choose for an application strongly depends on both the nature of the string (what does the string represent?) and the cause of dissimilarities between the strings you are measuring.您需要为应用程序选择的指标在很大程度上取决于字符串的性质(字符串代表什么?)以及您测量的字符串之间不同的原因。 For example, if you are comparing human-typed names that may contain typo's, the Jaro-Winkler distance may be of use.例如,如果您正在比较可能包含拼写错误的人工输入名称,那么 Jaro-Winkler 距离可能会有用。 If you are comparing names that were written down after hearing them, a phonetic distance may be a better choice.如果您要比较听过之后写下的名字,音标距离可能是更好的选择。

A more efficient algorithm would be to first split the data sets by the exact match column and then apply the method of funMerge2 .更有效的算法是首先按精确匹配列拆分数据集,然后应用funMerge2的方法。

library(stringdist)

funMerge2 <- function(X, Y, col, col_approx, method = "jw", threshold = 0.2){
  x <- X[[col_approx]]
  y <- Y[[col_approx]]
  d <- stringdistmatrix(x, y, method = method, useBytes = FALSE)
  w <- which(d < threshold, arr.ind = TRUE)
  Z1 <- X[w[, "row"], ]
  Z2 <- Y[w[, "col"], ]
  res <- cbind(Z1, Z2)
  common_cols <- grep(col, names(res))
  res <- res[apply(res[, common_cols], 1, function(x) x[1] == x[2]), ]
  row.names(res) <- NULL
  res
}

funMerge2(a, b, col = "date", col_approx = "name")
#    name date price       name date qty
#1 Ace Co 1991    10    Ace Co. 1991   9
#2  Bayes 1991    13 Bayes Inc. 1991  99
#3    asd 1991     2       asdf 1991  10
#4 Ace Co 1992    11    Ace CO. 1992  10
#5  Bayes 1992    15 Bayes INC. 1992 105
#6    asd 1992     3      aasdf 1992  15

Using distance matrix to merge fuzzy strings使用距离矩阵合并模糊字符串

Main principle主要原理

Get the distance matrix between each unique terms of you vectors.获取向量的每个唯一项之间的距离矩阵。 Then, check what threshold might lead to the best results (this has to be human supervised I think).然后,检查什么阈值可能会导致最佳结果(我认为这必须是人工监督的)。 Then, use this new correspondance table to merge your dataframes.然后,使用这个新的对应表来合并您的数据框。 Finallyyou can change names (ie adding "inc.") easier because you have "standardized" names.最后,您可以更轻松地更改名称(即添加“inc.”),因为您拥有“标准化”名称。

With utils::adist()使用utils::adist()

I think stringdist is better because you can choose the method, but here is a base example as a suggestion on how to use this concept of distance to get the expected output.我认为stringdist更好,因为您可以选择该方法,但这里有一个base示例,作为关于如何使用这种距离概念来获得预期输出的建议。

# 1st create a matrix with the Standard Levenshtein distance between the name fields of both sources (or other method from stringdist)
dist_name_matrix <- adist(unique(a$name), unique(b$name), partial = TRUE, ignore.case = TRUE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)

# lets convert this matrix to a dataframe for more visual changes, you will need to check it yourself
library(dplyr)
library(tidyr)

dist_df <- dist_name_matrix %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(., "a_name") %>% 
  pivot_longer(cols = 2:last_col(), names_to = "b_name", values_to = "dist") %>% 
  filter(dist < 2) # you might need to adapt this to your needs

# Now this can be used to merge your data i.e

a %>% 
  left_join(., dist_df, by = c("name" = "a_name")) %>% 
  right_join(., b, by = c("b_name" = "name", "date" = "date")) %>% 
  # added just to match your expected output
  filter(!is.na(name)) %>% 
  select(b_name, date, qty, price)

Output:输出:

      b_name date qty price
1    Ace Co. 1991   9    10
2 Bayes INC. 1991  99    13
3        asd 1991  10     2
4    Ace Co. 1992  10    11
5 Bayes INC. 1992 105    15
6        asd 1992  15     3

Same process can be used with stringdist :同样的过程可以与stringdist一起使用:

library(stringdist)
dist_name_matrix <- stringdistmatrix(unique(a$name), unique(b$name), method = "jw", useBytes = FALSE)
colnames(dist_name_matrix) <- unique(b$name)
rownames(dist_name_matrix) <- unique(a$name)

Then just adapt the threshold after human check ie filter(dist < 0.2)然后在人工检查后调整阈值即filter(dist < 0.2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM