简体   繁体   English

基于列值重复的函数的 R 函数

[英]R function for a function to be repeated based on column values

I have a large data set for which I need to do string matching.我有一个大数据集,我需要对其进行字符串匹配。 I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset.我从这个网站上得到了一些非常有用的帖子,并引用了它们 我已经创建了一个函数来为我的数据集进行字符串匹配。 I am pasting my sample data and code.我正在粘贴我的示例数据和代码。

SAMPLE DATA样本数据

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

CODE代码

fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){

  dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)

  min.name<-apply(dist.name, 1, min)

  match.s1.s2<-NULL
  for(i in 1:nrow(dist.name))
  {
    s2.i<-match(min.name[i],dist.name[i,])
    s1.i<-i
    match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
  }
  output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
  return(output)
}


match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")

This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas.这对于国家级别的数据工作正常,但是我在区域级别有多个数据文件,并且每个区域都有多个区域。 Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2.每个区域的区号可通过 data1 中的AREACODE变量和 data2 中的AREA_CODE变量获得。 I want to update my function so that我想更新我的功能,以便

  1. string matching is done for each area and the output has that area code对每个区域进行字符串匹配,输出具有该区域代码
  2. output is returned for each region consolidated for all area codes in that region.针对该地区所有区号合并的每个地区返回输出。

I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors.我试图使用 split 并将数据文件转换为列表并使用,然后使用 rbindlist 将它们组合起来,但无法成功,并且出现了不同类型的错误。 I am sure there is a way to do this but not able to get it.我相信有办法做到这一点,但无法得到它。 Hope I can have some suggestions.希望我能给一些建议。

While you can probably use an apply function to repeat over separate data files of different regions, here is a fuzzyjoin solution based on my answer to your previous question .虽然您可能可以使用应用函数来重复不同区域的单独数据文件,但这里有一个基于我对您上一个问题的回答的fuzzyjoin解决方案。

It looks for the best stringdist match for Address and the AreaCode must match exactly ( == ).它寻找 Address 的最佳stringdist匹配,并且 AreaCode 必须完全匹配 ( == )。 I also specified year2 had to be >= year1, just for demonstration.我还指定 year2 必须>= year1,仅用于演示。

Finally, I used dplyr::group_by and dplyr::top_n to get the minimum distance matches and I had to assume what to do in matching ties (picked matches with largest year2).最后,我使用dplyr::group_bydplyr::top_n来获得最小距离匹配,我不得不假设在匹配关系(选择最大年份 2 的匹配)中做什么。 You can also use slice_min which replaces the older top_n and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)您还可以使用slice_min替换旧的top_n ,如果原始顺序很重要且不按字母顺序排列,请使用mutate(rank = row_number(dist)) %>% filter(rank == 1)

Data:数据:

Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')

data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]

Solution:解决方案:

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)

# First, need to define match_fun_stringdist 
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
  
  # Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
  ignore_case = FALSE
  method = "dl"
  max_dist = 99
  distance_col = "dist"
  
  if (ignore_case) {
    v1 <- stringr::str_to_lower(v1)
    v2 <- stringr::str_to_lower(v2)
  }
  
  # shortcut for Levenshtein-like methods: if the difference in
  # string length is greater than the maximum string distance, the
  # edit distance must be at least that large
  
  # length is much faster to compute than string distance
  if (method %in% c("osa", "lv", "dl")) {
    length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
    include <- length_diff <= max_dist
    
    dists <- rep(NA, length(v1))
    
    dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
  } else {
    # have to compute them all
    dists <- stringdist::stringdist(v1, v2, method = method)
  }
  ret <- dplyr::data_frame(include = (dists <= max_dist))
  if (!is.null(distance_col)) {
    ret[[distance_col]] <- dists
  }
  ret
}

# Finally, call fuzzy_join
fuzzy_join(data1, data2, 
           by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")), 
           match_fun = list(match_fun_stringdist, `==`, `<=`),
           mode = "left"
           ) %>%
  group_by(Address1, Year1, AREACODE) %>%
  top_n(1, -Address1.dist) %>%
  top_n(1, Year2) %>%
  select(unique_id, Address1.dist, everything())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否有R函数可根据一个列变量查找make列值? - Is there a R function to find make column values based on one column variable? R - 基于使用另一列的函数为一列添加值 - R - adding values for one column based on a function using another column 根据 R 中的日期列创建具有重复值的新 dataframe - Create new dataframe with repeated values based on Date Column in R R函数根据相同列中的先前值创建新列 - R function to make new column based on previous values in the same colums 是否有一个 R 函数可以根据下一列迭代更改值? - Is there an R function to iteratively change values based on the next column? 是否有一个 R 函数可以根据列中的值将一行分成三个不同的行? - Is there an R function to split a row into three different rows based on the values in a column? R合并功能重复出现“ by”错误 - Repeated “by” error with R Merge Function 根据从不同列获得的值创建新列,使用 R 中的 mutate() 和 case_when 函数 - Creating a new column based on values obtained from different column, using mutate() and case_when function in R R:根据列值将函数应用于子集 - R: apply function to subsets based on column value 如何根据R中的列创建(或函数)循环? - How to create a loop for (or function) based on the column in R?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM