[英]R function for a function to be repeated based on column values
I have a large data set for which I need to do string matching.我有一个大数据集,我需要对其进行字符串匹配。 I have got some very useful posts from this site and referring them I have created a function to do the string matching for my dataset.我从这个网站上得到了一些非常有用的帖子,并引用了它们 我已经创建了一个函数来为我的数据集进行字符串匹配。 I am pasting my sample data and code.我正在粘贴我的示例数据和代码。
SAMPLE DATA样本数据
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')
data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]
CODE代码
fn.fuzzymatch<-function(dat1,dat2,string1,string2,meth){
dist.name<-stringdistmatrix(dat1[[string1]],dat2[[string2]],method = meth)
min.name<-apply(dist.name, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(dist.name))
{
s2.i<-match(min.name[i],dist.name[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s1_row=s1.i,s2_row=s2.i,s1name=dat1[s1.i,][[string1]],s2name=dat2[s2.i,][[string2]], dist=min.name[i]),match.s1.s2)
}
output <- (match.s1.s2)[order(match.s1.s2$s1_row),]
return(output)
}
match_50 <- fn.fuzzymatch(data1,data2,"Address1","Address2","dl")
This is working fine for the data at country level, but then I have multiple data files at region level and each region is having multiple areas.这对于国家级别的数据工作正常,但是我在区域级别有多个数据文件,并且每个区域都有多个区域。 Areacode for each region is available by the AREACODE variable in data1 and AREA_CODE variable in data2.每个区域的区号可通过 data1 中的AREACODE变量和 data2 中的AREA_CODE变量获得。 I want to update my function so that我想更新我的功能,以便
I was trying to use split and to convert the data files into list and use and then use rbindlist to combine them but not able to succeed and have been getting different kinds of errors.我试图使用 split 并将数据文件转换为列表并使用,然后使用 rbindlist 将它们组合起来,但无法成功,并且出现了不同类型的错误。 I am sure there is a way to do this but not able to get it.我相信有办法做到这一点,但无法得到它。 Hope I can have some suggestions.希望我能给一些建议。
While you can probably use an apply function to repeat over separate data files of different regions, here is a fuzzyjoin
solution based on my answer to your previous question .虽然您可能可以使用应用函数来重复不同区域的单独数据文件,但这里有一个基于我对您上一个问题的回答的fuzzyjoin
解决方案。
It looks for the best stringdist
match for Address and the AreaCode must match exactly ( ==
).它寻找 Address 的最佳stringdist
匹配,并且 AreaCode 必须完全匹配 ( ==
)。 I also specified year2 had to be >=
year1, just for demonstration.我还指定 year2 必须>=
year1,仅用于演示。
Finally, I used dplyr::group_by
and dplyr::top_n
to get the minimum distance matches and I had to assume what to do in matching ties (picked matches with largest year2).最后,我使用dplyr::group_by
和dplyr::top_n
来获得最小距离匹配,我不得不假设在匹配关系(选择最大年份 2 的匹配)中做什么。 You can also use slice_min
which replaces the older top_n
and if the original order is important and not alphabetical, use mutate(rank = row_number(dist)) %>% filter(rank == 1)
您还可以使用slice_min
替换旧的top_n
,如果原始顺序很重要且不按字母顺序排列,请使用mutate(rank = row_number(dist)) %>% filter(rank == 1)
Data:数据:
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR")
AREACODE <- c('10','10','14','20','30')
Year1 <- c(2001:2005)
Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)
AREA_CODE <- c('10','10','10','20','30','40','50','61','64', '99')
data1 <- data.table(Address1, Year1, AREACODE)
data2 <- data.table(Address2, Year2, AREA_CODE)
data2[, unique_id := sprintf("%06d", 1:nrow(data2))]
Solution:解决方案:
library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
# First, need to define match_fun_stringdist
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
# Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
ignore_case = FALSE
method = "dl"
max_dist = 99
distance_col = "dist"
if (ignore_case) {
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
}
# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large
# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl")) {
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist
dists <- rep(NA, length(v1))
dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
} else {
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method)
}
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dists
}
ret
}
# Finally, call fuzzy_join
fuzzy_join(data1, data2,
by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")),
match_fun = list(match_fun_stringdist, `==`, `<=`),
mode = "left"
) %>%
group_by(Address1, Year1, AREACODE) %>%
top_n(1, -Address1.dist) %>%
top_n(1, Year2) %>%
select(unique_id, Address1.dist, everything())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.