简体   繁体   English

在R中,如何使用模糊匹配来搜索多个模式?

[英]In R, how do I use fuzzy matching to search for multiple patterns?

I have a survey dataset in which respondents described the location of their activity, usually as a town or city name. 我有一个调查数据集,其中受访者通常以城镇或城市名称描述其活动的地点。 I want to identify each unique mention of the named cities and count the number of times each city was mentioned. 我想识别每个提到的命名城市,并计算每个城市被提及的次数。 The final output should be a vector with counts of the number of times each city was mentioned. 最终输出应该是一个向量,其中包含提及每个城市的次数。 One challenge is that city names may be misspelled, have odd capitalization, or be embedded within a longer string (which may also include more than one city). 一个挑战是城市名称可能拼写错误,大小写奇数或嵌入更长的字符串中(可能还包括多个城市)。 I have a master list of city names with proper capitalization and spelling which I have been trying to use as my pattern with the agrep function. 我有一个城市名的主列表,带有正确的大小写和拼写,我一直在尝试使用agrep函数作为我的模式。

A sample chunk of the dataset is structured as follows: 数据集的样本块的结构如下:

survey <- c("Salem", "salem, ma","Manchester","Manchester-By-The-Sea")
master <- c("Beverly","Gloucester","Manchester-by-the-Sea","Nahant","Salem")

In this sample, the final result would be a vector: 在此样本中,最终结果将是一个向量:

result
[1] 0 0 2 0 2

I have been trying to construct a function using agrep to loop through the master vector so that it searches through the survey vector for matches, counts the number of matches, and then outputs the number of matches for each item of the master vector. 我一直在尝试使用agrep构造一个函数来遍历主向量,以便它在调查向量中搜索匹配项,计算匹配数,然后输出每个主向量项的匹配数。 Here is what I have so far, but I all get is NULL. 到目前为止,这是我所拥有的,但是我得到的都是NULL。 Not sure what I am doing wrong and/or if there is a better way to approach this problem. 不知道我在做什么错,和/或是否有更好的方法来解决此问题。

idx <- NULL
matches <- NULL
n.match <- function(pattern, x, ...) {
for (i in 1:length(pattern))
   idx <- vector()
   idx <- agrep(pattern[i],x,ignore.case=TRUE, value=FALSE, max.distance = 2)
   matches[i] <- length(idx)
}
n.match(master,survey)
matches

The main problem is that you are missing a block {} around your for loop. 主要问题是您在for循环中缺少块{} You are really only initializing idx 5 times and leaving i set at 5. Plus there's no reason to keep variables needed inside your function outside as well. 您实际上只将idx初始化了5次,而将i设置为5。另外,也没有理由将函数内部所需的变量也保留在外部。 How about 怎么样

survey <- c("Salem", "salem, ma","Manchester","Manchester-By-The-Sea")
master <- c("Beverly","Gloucester","Manchester-by-the-Sea","Nahant","Salem")

n.match <- function(pattern, x, ...) {
    matches <- numeric(length(pattern))
    for (i in 1:length(pattern)) {
       idx <- agrep(pattern[i],x,ignore.case=TRUE, max.distance = 2)
       matches[i] <- length(idx)
    }
    matches       
}
n.match(master,survey)
# [1] 0 0 1 0 2

Here i also played with max.distance= to make it a proportion rather than an absolute number. 在这里,我还使用max.distance=来使它成比例而不是绝对数。 However it still looks like "Manchester" is too different than "Manchester-by-the-Sea" in terms of the number of deletions required to get them to match. 但是,就使它们匹配所需的删除次数而言,“ Manchester”仍然看起来与“ Manchester-by-the-Sea”相差太大。 You may consider down-weighting deletions 您可以考虑减少权重删除

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM