简体   繁体   中英

R Relevant match between 2 huge data sets. Even with Spelling Mistakes

I have input

"I am travelling on my own, I have just brought a world ticket to go to singapore, darwin, perth, adelaide, melbourne, brisbane, gold cost, sydney Opra, christchurch,gold coast Richland, Aukland,Austrlia, and fji. It is a 10 month journey. I will be going on my own, I am not scared but my friends and family seem to be against the idea. I have explained that it is safe and that I will probably meet people along the way and hostels are not as bad as theya re made out to be. for at least a 1/3 of my trip i will be staying with friends and family. I am excited, but people pesimistic views are making me doubt the safety. I am from the UK so will be a long way from home, and they are scared incase I get into trouble. I have never been to US"

I have a places list as big as 5000 rows. Like London, Singapore, Sydney, Aukland , Fiji,Gold Coast, Sydney Opera, Australia,UK, USA....

Problem Get the places out of the input by matching from Places List. With Spelling Mistakes and Closest Match. Optimization is required.

Output Singapore|Darwin|perth|adelaide|melbourne|brisbane|gold coast|sydney Opera|christchurch|Aukland|Austrlia|fiji|UK|USA

Tried Methods

library(RecordLinkage)
library(stringdist)
input=tolower(gsub('[[:punct:]]', " ", input))
Places <- read.delim("\\Data\\Places_List.csv", row.names =NULL,header=TRUE,sep=",")
Places <-as.matrix(Places)
##################Different Methods Tried##########################
ClosestMatch2 = function(string, stringVector){

distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
ClosestMatch2(input,Places)
###############The above 1 doesn't Work##################
ClosestMatch <- function(string,StringVector) {
matches <- agrep(string,StringVector,value=TRUE)
distance <- sdists(string,matches,method = "",weight = c(1, 0, 2))
matches <- data.frame(matches,as.numeric(distance))
matches <- subset(matches,distance==min(distance))
as.character(matches$matches)
}
ClosestMatch(input,Places)
########This work but not proper Results###########
k=as.matrix((sapply(input,agrep,places)))
######This didnt work either
 agrep, pmatch , str_detect(wont work for spelling Mistakes) doesn't work for bigger data sets 

最接近match2的作品,除了添加字符数差和子字符串部分匹配以与拼写错误匹配外

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM