[英]Extract approximate key terms (fuzzy) from sentence in dataframe. R
我的任务是从期刊文章的标题中提取特定的词(物种名称的第一个词)。 这是我的数据集的可重现版本:
df <- data.frame(article_title = c("I like chickens and how to find chickens",
"A Horse hootio is going to the rainbow",
"A Cat caticus is eating cheese",
"A Dog dogigo runs over a car",
"A Hippa potamus is in the sauna", # contains mispelling
"Mos musculus found on a boat", # contains mispelling
"A sentence not related to animals"))
我要提取的关键词如下(使用正则表达式边界包装器):
words_to_match <- c('\\bchicken\\b', '\\bhorse\\b', '\\bcat\\b',
'\\bdog\\b',
'\\bhippo\\b', # hippo
'\\bmus\\b', # mus
'\\banimals\\b')
问题是当我运行这个时:
df %>%
dplyr::mutate(matched_word = stringr::str_extract_all(string = article_title,
pattern = regex(paste(words_to_match, collapse = '|'), ignore_case = TRUE)))
问题:某些标题包含未检测到的拼写错误。
article_title matched_word
1 Chicken chook finds a pearl Chicken
2 A Horse hootio is going to the rainbow Horse
3 A Cat caticus is eating cheese Cat
4 A Dog dogigo runs over a car Dog
5 A Hippa potamus is in the sauna
6 Mos musculus found on a boat
7 A sentence not related to animals animals
我想要做的是找到一种方法来制作另一列,告诉我是否可能与我的任何words_to_match
匹配,也许还有 % 匹配(Levenshtein 距离)。
也许是这样的:
article_title matched_word %
1 Chicken chook finds a pearl Chicken 100
2 A Horse hootio is going to the rainbow Horse 100
3 A Cat caticus is eating cheese Cat 100
4 A Dog dogigo runs over a car Dog 100
5 A Hippa potamus is in the sauna Hippo XX
6 Mos musculus found on a boat Mus XX
7 A sentence not related to animals animals 100
任何建议即使不使用R
也将不胜感激
您可以使用adist
查找近似匹配项:
x <- adist(words_to_match, df$article_title, fixed=FALSE, ignore.case = TRUE)
i <- apply(x, 1, which.min)
df$matched_word <- words_to_match[i]
df$adist <- mapply("[", asplit(x, 2), i)
df
# article_title matched_word adist
#1 I like chickens and how to find chickens \\bchicken\\b 2
#2 A Horse hootio is going to the rainbow \\bhorse\\b 0
#3 A Cat caticus is eating cheese \\bcat\\b 0
#4 A Dog dogigo runs over a car \\bdog\\b 0
#5 A Hippa potamus is in the sauna \\bhippo\\b 1
#6 Mos musculus found on a boat \\bmus\\b 1
#7 A sentence not related to animals \\banimals\\b 0
您可以将简单的单词放入向量wm
并strsplit
每个句子。 然后在lapply
中使用adist
获取每个单词到每个元素wm
的距离矩阵。 最小值应该给你最好的匹配。 不过,我不确定您以百分比表示的 levenshtein 距离 (LD) 的基本原理。
wm <- c("chicken", "horse", "cat", "dog", "hippo", "mus", "animals")
dl <- strsplit(df$article_title, " ")
res <- do.call(rbind, lapply(dl, function(x) {
e <- adist(tolower(x), wm)
mins <- apply(e, 2, min)
emin <- which.min(mins)
data.frame(matched_word=wm[emin], LD=mins[emin])
}))
res
# matched_word LD
# 1 chicken 1
# 2 horse 0
# 3 cat 0
# 4 dog 0
# 5 hippo 1
# 6 mus 1
# 7 animals 0
数据:
df <- structure(list(article_title = c("I like chickens and how to find chickens",
"A Horse hootio is going to the rainbow", "A Cat caticus is eating cheese",
"A Dog dogigo runs over a car", "A Hippa potamus is in the sauna",
"Mos musculus found on a boat", "A sentence not related to animals"
)), class = "data.frame", row.names = c(NA, -7L))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.