简体   繁体   English

通过模糊匹配名称创建唯一ID(通过使用R的agrep)

[英]Create a unique ID by fuzzy matching of names (via agrep using R)

Using R, I am trying match on people's names in a dataset structured by year and city. 使用R,我尝试在按年份和城市构建的数据集中匹配人名。 Due to some spelling mistakes, exact matching is not possible, so I am trying to use agrep() to fuzzy match names. 由于一些拼写错误,无法进行精确匹配,因此我尝试使用agrep()来模糊匹配名称。

A sample chunk of the dataset is structured as follows: 数据集的样本块结构如下:

df <- data.frame(matrix( c("1200013","1200013","1200013","1200013","1200013","1200013","1200013","1200013",                             "1996","1996","1996","1996","2000","2000","2004","2004","AGUSTINHO FORTUNATO FILHO","ANTONIO PEREIRA NETO","FERNANDO JOSE DA COSTA","PAULO CEZAR FERREIRA DE ARAUJO","PAULO CESAR FERREIRA DE ARAUJO","SEBASTIAO BOCALOM RODRIGUES","JOAO DE ALMEIDA","PAULO CESAR FERREIRA DE ARAUJO"), ncol=3,dimnames=list(seq(1:8),c("citycode","year","candidate")) ))

The neat version: 整洁的版本:

  citycode year                      candidate
1  1200013 1996      AGUSTINHO FORTUNATO FILHO
2  1200013 1996           ANTONIO PEREIRA NETO
3  1200013 1996         FERNANDO JOSE DA COSTA
4  1200013 1996 PAULO CEZAR FERREIRA DE ARAUJO
5  1200013 2000 PAULO CESAR FERREIRA DE ARAUJO
6  1200013 2000    SEBASTIAO BOCALOM RODRIGUES
7  1200013 2004                JOAO DE ALMEIDA
8  1200013 2004 PAULO CESAR FERREIRA DE ARAUJO

I'd like to check in each city separately, whether there are candidates appearing in several years. 我想分别检查每个城市,是否有候选人出现在几年。 Eg in the example, 例如,在示例中,

PAULO CEZAR FERREIRA DE ARAUJO PAULO CEZAR FERREIRA DE ARAUJO

PAULO CESAR FERREIRA DE ARAUJO PAULO CESAR FERREIRA DE ARAUJO

appears twice (with a spelling mistake). 出现两次(拼写错误)。 Each candidate across the entire data set should be assigned a unique numeric candidate ID. 应为整个数据集中的每个候选者分配唯一的数字候选ID。 The dataset is fairly large (5500 cities, approx. 100K entries) so a somewhat efficient coding would be helpful. 数据集相当大(5500个城市,大约100K条目),因此稍微有效的编码会有所帮助。 Any suggestions as to how to implement this? 有关如何实现这一点的任何建议?

EDIT: Here is my attempt (with help from the comments thus far) that is very slow (inefficient) in achieving the task at hand. 编辑:这是我尝试(在迄今为止的评论的帮助下)在实现手头任务时非常缓慢(低效)。 Any suggestions as to improvements to this? 有关改进的建议吗?

f <- function(x) {matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  x
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

EDIT 2: Now running at good speed. 编辑2:现在以良好的速度运行。 Problem was the comparison to many factors at every step (Thanks for pointing that out, Blue Magister). 问题在于每一步都与许多因素进行比较(感谢你指出这一点,Blue Magister)。 Reducing the comparison to only the candidates in one group (ie a city) runs the command in 5 seconds for 80,000 lines - a speed I can live with. 将比较减少到只有一组中的候选者(即一个城市),在5秒内运行命令,持续80,000行 - 这是我可以忍受的速度。

df$candidate <- as.character(df$candidate)

f <- function(x) {x <- as.factor(x)
                  matches <- lapply(levels(x), agrep, x=levels(x),fixed=TRUE, value=FALSE)
                  levels(x) <- levels(x)[unlist(lapply(matches, function(x) x[1]))]
                  as.character(x)
                }

temp <- tapply(df$candidate, df$citycode, f, simplify=TRUE)
df$candidatenew <- unlist(temp)
df$spellerror <- ifelse(as.character(df$candidate)==as.character(df$candidatenew), 0, 1)

Here's my shot at it. 这是我的镜头。 It's probably not very efficient, but I think it will get the job done. 它可能效率不高,但我认为它可以完成工作。 I assume that df$candidates is of class factor. 我认为df$candidates是阶级因素。

#fuzzy matches candidate names to other candidate names
#compares each pair of names only once
##by looking at names that have a greater index
matches <- unlist(lapply(1:(length(levels(df[["candidate"]]))-1),
    function(x) {max(x,x + agrep(
        pattern=levels(df[["candidate"]])[x], 
        x=levels(df[["candidate"]])[-seq_len(x)]
    ))}
))
#assigns new levels (omits the last level because that doesn't change)
levels(df[["candidate"]])[-length(levels(df[["candidate"]]))] <- 
    levels(df[["candidate"]])[matches]

Ok, given that the focus is on the efficiency, I'd suggest the following. 好吧,鉴于重点是效率,我建议如下。

First, note that in order of efficiency from first principles we could predict that exact matching will be much faster than grep which will be faster than fuzzy grep. 首先,请注意,从第一原理的效率开始,我们可以预测精确匹配将比grep快得多,后者将比模糊grep更快。 So exact match, then fuzzy grep for the remaining observations. 如此精确匹配,然后模糊grep用于剩余的观察。

Second, vectorize and avoid loops. 其次,矢量化和避免循环。 The apply commands aren't necessarily faster, so stick to native vectorization if you can. apply命令不一定更快,因此如果可以,请坚持使用本机矢量化。 All the grep commands are natively vectorized, but it's going to be hard to avoid a *ply or loop to compare each element to the vector of others to match to. 所有grep命令都是本机矢量化的,但是很难避免使用*ply或循环来将每个元素与其他元素的矢量进行比较以匹配。

Third, use outside information to narrow the problem down. 第三,利用外部信息缩小问题范围。 Do fuzzy matching on names only within each city or state, which will dramatically reduce the number of comparisons which must be made, for instance. 例如,只对每个城市或州内的名称进行模糊匹配,这将大大减少必须进行的比较次数。

You can combine the first and third principles: You might even try exact matching on the first character of each string, then fuzzy matching within that. 您可以结合第一和​​第三原则:您甚至可以尝试在每个字符串的第一个字符上进行精确匹配,然后在其中进行模糊匹配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM