简体   繁体   English

R模糊字符串匹配以基于匹配的字符串返回特定的列

[英]R fuzzy string match to return specific column based on matched string

I have two large datasets, one around half a million records and the other one around 70K. 我有两个大型数据集,一个大约50万条记录,另一个大约70K。 These datasets have address. 这些数据集具有地址。 I want to match if any of the address in the smaller data set are present in the large one. 我想匹配较小数据集中的任何地址是否存在于较大的数据集中。 As you would imagine address can be written in different ways and in different cases / spellings etc. Apart from this address can be duplicated if written only till the building level. 就像您想象的那样,地址可以用不同的方式以及在不同的情况/拼写等方式中写入。除此地址外,如果仅在建筑物级别之前写入,则可以重复。 So different flats have the same address. 因此,不同的单位具有相同的地址。 I did some research and figured out the package stringdist that can be used. 我做了一些研究,并弄清楚了可以使用的stringdist软件包。

I did some work and managed to get the closest match based on distance. 我做了一些工作,并设法根据距离找到最接近的匹配项。 However I am not able to return the corresponding columns for which the address match. 但是,我无法返回地址匹配的相应列。

Below is a sample dummy data along with code that I have created to explain the situation 以下是示例虚拟数据以及我创建的用于解释这种情况的代码

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

This returns me the closed string match based on distance of 3, however I wanted to also have columns of "Year" and "unique_id" from df2 in df1. 这将基于距离3返回封闭的字符串匹配,但是我想在df1中也包含df2中的“年份”和“ unique_id”列。 This would help me to know with which row of data the string was matched from df2. 这将有助于我了解df2中的字符串与哪一行数据匹配。 So finally I want to know for each row in df1 what was the closet match from df2 based on the distance specified and have for the matching rows the specific "Year" and "unique_id" from df2 . 因此,最后我想根据指定的距离为df1中的每一行确定 df2中的壁橱匹配项,并为匹配的行指定df2中的特定“ Year”“ unique_id”

I guess there is something to do with merge (left join), but I am not sure how I can merge keeping the duplicates and ensuring that I have same number of rows as in df1 (small data set). 我猜想与合并(左联接)有关,但是我不确定如何合并并保留重复项并确保行数与df1(小型数据集)相同。

Any kind of solution would help!! 任何一种解决方案都将帮助!

You are 90% of the way there... 您已经90%到达那里了...

You say you want to 你说你想

know with which row of data the string was matched from df2 知道与df2匹配的字符串是哪一行数据

You just need to understand the code you already have. 您只需要了解已有的代码即可。 See ?amatch : ?amatch

amatch returns the position of the closest match of x in table . amatch返回xtable最匹配的位置。 When multiple matches with the same smallest distance metric exist, the first one is returned. 当存在多个具有相同最小距离度量的匹配时,将返回第一个。

In other words, amatch gives you the index for the row in df2 (which is your table ) that is the closest match of each address in df1 (which is your x ). 换句话说, amatch为您提供df2 (这是您的table )中该行的索引,该索引是df1 (这是您的x )中每个地址的最接近匹配项。 You are prematurely wrapping this index by returning the new address instead. 您通过返回新地址来过早包装此索引。

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join. 取而代之的是,检索索引本身以进行查找, 或者检索左连接的unique_id(如果您确信它确实是唯一的ID)。

Illustration of both approaches: 两种方法的说明:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

Here is a solution using the fuzzyjoin package. 这是使用fuzzyjoin软件包的解决方案。 It uses dplyr -like syntax and stringdist as one of the possible types of fuzzy matching. 它使用dplyr的语法和stringdist作为模糊匹配的可能类型之一。

You can use stringdist method="dl" (or others that might work better). 您可以使用stringdist method =“ dl”(或其他效果更好的方法)。

To meet your requirement of "ensuring that I have same number of rows as in df1", I used a large max_dist and then used dplyr::group_by and dplyr::top_n to get only the best match with minimum distance. 为了满足您的“确保行数与df1中的行数相同”的要求,我使用了一个较大的max_dist,然后使用dplyr::group_bydplyr::top_n来仅以最小的距离获得最佳匹配。 This was suggested by dgrtwo, the developer of fuzzyjoin . 这是建议由dgrtwo,开发商fuzzyjoin (Hopefully it'll be part of the package itself in the future.) (希望它将来会成为程序包本身的一部分。)

(I also had to make an assumption to take the max year2 in the event of distance ties.) (如果发生距离限制,我还必须假设采用最大year2。)

Code: 码:

library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) 
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2, 
                by = c("Address1" = "Address2"), 
                mode = "left", 
                method = "dl", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(Address1, Year1) %>%
  top_n(1, -dist) %>%
  top_n(1, Year2)

Result: 结果:

# A tibble: 7 x 6
# Groups:   Address1, Year1 [7]
                                Address1 Year1                             Address2 Year2 unique_id  dist
                                   <chr> <int>                                <chr> <int>     <chr> <dbl>
1                    786, GALI NO 5, XYZ  2001                   786, GALI NO 4 XYZ  2007    000007     2
2       rambo, 45, strret 4, atlast, pqr  2002 del, 546, strret2, towards east, pqr  2009    000009    17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR  2003                  23/4, STREET 2, PQR  2010    000010    19
4                    45-B, GALI NO5, XYZ  2004                  45B, GALI NO 5, XYZ  2008    000008     2
5                 HECTIC, 99 STREET, PQR  2005                  23/4, STREET 2, PQR  2010    000010    11
6                    786, GALI NO 5, XYZ  2006                   786, GALI NO 4 XYZ  2007    000007     2
7       rambo, 45, strret 4, atlast, pqr  2007 del, 546, strret2, towards east, pqr  2009    000009    17

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM