基于列值重复的函数的 R 函数

Question

I have a large data set for which I need to do string matching.我有一个大数据集，我需要对其进行字符串匹配。 I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset.我从这个网站上得到了一些非常有用的帖子，并引用了它们我已经创建了一个函数来为我的数据集进行字符串匹配。 I am pasting my sample data and code.我正在粘贴我的示例数据和代码。

SAMPLE DATA样本数据

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

CODE代码

fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){

  dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)

  min.name<-apply(dist.name, 1, min)

  match.s1.s2<-NULL
  for(i in 1:nrow(dist.name))
  {
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
  }
  output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
  return(output)
}


match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")

This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas.这对于国家级别的数据工作正常，但是我在区域级别有多个数据文件，并且每个区域都有多个区域。 Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2.每个区域的区号可通过 data1 中的AREACODE变量和 data2 中的AREA_CODE变量获得。 I want to update my function so that我想更新我的功能，以便

string matching is done for each area and the output has that area code对每个区域进行字符串匹配，输出具有该区域代码
output is returned for each region consolidated for all area codes in that region.针对该地区所有区号合并的每个地区返回输出。

I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors.我试图使用 split 并将数据文件转换为列表并使用，然后使用 rbindlist 将它们组合起来，但无法成功，并且出现了不同类型的错误。 I am sure there is a way to do this but not able to get it.我相信有办法做到这一点，但无法得到它。 Hope I can have some suggestions.希望我能给一些建议。

Answer 1

While you can probably use an apply function to repeat over separate data files of different regions, here is a fuzzyjoin solution based on my answer to your previous question .虽然您可能可以使用应用函数来重复不同区域的单独数据文件，但这里有一个基于我对您上一个问题的回答的fuzzyjoin解决方案。

It looks for the best stringdist match for Address and the AreaCode must match exactly ( == ).它寻找 Address 的最佳stringdist匹配，并且 AreaCode 必须完全匹配 ( == )。 I also specified year2 had to be >= year1, just for demonstration.我还指定 year2 必须>= year1，仅用于演示。

Finally, I used dplyr::group_by and dplyr::top_n to get the minimum distance matches and I had to assume what to do in matching ties (picked matches with largest year2).最后，我使用dplyr::group_by和dplyr::top_n来获得最小距离匹配，我不得不假设在匹配关系（选择最大年份 2 的匹配）中做什么。 You can also use slice_min which replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)您还可以使用slice_min替换旧的top_n ，如果原始顺序很重要且不按字母顺序排列，请使用mutate(rank = row_number(dist)) %>% filter(rank == 1)

Data:数据：

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

Solution:解决方案：

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)

# First, need to define match_fun_stringdist 
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
  
  # Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
  ignore_case = FALSE
  method = "dl"
  max_dist = 99
  distance_col = "dist"
  
  if (ignore_case) {
    v1 <- stringr::str_to_lower(v1)
    v2 <- stringr::str_to_lower(v2)
  }
  
  # shortcut for Levenshtein-like methods: if the difference in
  # string length is greater than the maximum string distance, the
  # edit distance must be at least that large
  
  # length is much faster to compute than string distance
  if (method %in% c("osa", "lv", "dl")) {
    length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
    include <- length_diff <= max_dist
    
    dists <- rep(NA, length(v1))
    
    dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
  } else {
    # have to compute them all
    dists <- stringdist::stringdist(v1, v2, method = method)
  }
  ret <- dplyr::data_frame(include = (dists <= max_dist))
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- dists
  }
  ret
}

# Finally, call fuzzy_join
fuzzy_join(data1, data2, 
           by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")), 
           match_fun = list(match_fun_stringdist, `==`, `<=`),
           mode = "left"
           ) %>%
  group_by(Address1, Year1, AREACODE) %>%
  top_n(1, -Address1.dist) %>%
  top_n(1, Year2) %>%
  select(unique_id, Address1.dist, everything())

基于列值重复的函数的 R 函数

问题描述

1 个解决方案

解决方案1
0 2017-06-06 06:47:01

基于列值重复的函数的 R 函数

问题描述

1 个解决方案

解决方案1 0 2017-06-06 06:47:01

解决方案1
0 2017-06-06 06:47:01