[英]Replace whole word or words with partial match in R
I have a data frame with thousands of misspelled city names.我有一个包含数千个拼写错误的城市名称的数据框。 I need to correct these and can't find the solution though I've searched extensively.
尽管我进行了广泛搜索,但我需要更正这些并且找不到解决方案。 I've tried several functions and approaches
我尝试了几种功能和方法
This is a miniature sample of the data:这是数据的一个微型样本:
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
num city
1 1 BORNE
2 2 BOERNAE
3 3 BARNE
4 4 BOERNE
5 5 GALDEN
6 6 GELDON
7 7 GOELDEN
8 8 GOLDEN
These are some of the functions I've tried, tried many more including str_replace and str_detect:这些是我尝试过的一些功能,还尝试了更多功能,包括 str_replace 和 str_detect:
cit <- function(x){
ifelse(x %in% grepl(c("BOR","BOE","BAR")),"BOERNE",
ifelse(x %in% grepl(c("GAL","GEL","GOE")), "GOLDEN", "OTHER"))
}
Or或者
cit <- function(x){
ifelse(x %in% c("BOR","BOE","BAR"),"BOERNE",
ifelse(x %in% c("GAL","GEL","GOE"), "GOLDEN", "OTHER"))
}
Run code:运行代码:
`citA$city2 <- cit(citA$city)`
Incorrect result:结果不正确:
num city city2
1 1 BOERNE OTHER
2 2 BOERNAE OTHER
3 3 BARNE OTHER
4 4 BOERNE OTHER
5 5 GALDEN OTHER
6 6 GELDON OTHER
7 7 GOELDEN OTHER
8 8 GOLDEN OTHER
Also tried:也试过:
citA$city[grepl(c("BOR","BOE","BAR"),citA$city)] <- "BOERNE"
But that throws an error:但这会引发错误:
Warning message:
In grepl(c("BOR", "BOE", "BAR"), citA$city) :
argument 'pattern' has length > 1 and only the first element will be used
Your ideas would be greatly helpful!你的想法会很有帮助!
We can paste
it to a single string for the pattern
in grep
with |
我们可以将其
paste
到grep
中pattern
的单个字符串中,使用|
(meaning OR
). (意思是
OR
)。 The pattern
argument in grep
is not vectorized ie it takes only a single element grep
中的pattern
参数未矢量化,即它只需要一个元素
citA$city[grepl(paste(c("BOR","BOE","BAR"), collapse="|"),citA$city)] <- "BOERNE"
citA
# num city
#1 1 BOERNE
#2 2 BOERNE
#3 3 BOERNE
#4 4 BOERNE
#5 5 GALDEN
#6 6 GELDON
#7 7 GOELDEN
#8 8 GOLDEN
NOTE: The column 'city' is created as factor
.注意: 'city' 列被创建为
factor
。 It should be a character
class by making use of stringsAsFactors = FALSE
它应该是一个
character
class 通过使用stringsAsFactors = FALSE
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"),
stringsAsFactors = FALSE)
If you have many such patterns you can use case_when
from dplyr
:如果你有很多这样的模式,你可以使用
case_when
中的dplyr
:
library(dplyr)
library(stringr)
citA %>%
mutate(city2 = case_when(str_detect(city, 'BOR|BOE|BAR') ~ 'BOERNE',
str_detect(city, 'GAL|GEL|GOE|GOL') ~ 'GOLDEN',
TRUE ~ 'OTHER'))
# num city city2
#1 1 BORNE BOERNE
#2 2 BOERNAE BOERNE
#3 3 BARNE BOERNE
#4 4 BOERNE BOERNE
#5 5 GALDEN GOLDEN
#6 6 GELDON GOLDEN
#7 7 GOELDEN GOLDEN
#8 8 GOLDEN GOLDEN
I've got a package on github that may help, that allows recoding of factor levels with regex matching.我在 github 上有一个 package 可能会有所帮助,它允许使用正则表达式匹配重新编码因子级别。 Load with package with
用 package 加载
devtools::install_github("jwilliman/xfactor")
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
citA$city2 <- xfactor::xfactor(citA$city, levels = c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city city2
#> 1 1 BORNE BOERNE
#> 2 2 BOERNAE BOERNE
#> 3 3 BARNE BOERNE
#> 4 4 BOERNE BOERNE
#> 5 5 GALDEN GOLDEN
#> 6 6 GELDON GOLDEN
#> 7 7 GOELDEN GOLDEN
#> 8 8 GOLDEN GOLDEN
Created on 2020-04-20 by the reprex package (v0.3.0)由代表 package (v0.3.0) 于 2020 年 4 月 20 日创建
Otherwise you could use the following function to clean/update the factor levels, uses a similar syntax.否则,您可以使用以下 function 来清理/更新因子水平,使用类似的语法。
citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
"city" = c("BORNE","BOERNAE","BARNE","BOERNE",
"GALDEN","GELDON","GOELDEN","GOLDEN"))
make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {
lvls <- levels(.f)
# Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
if(is.null(replacement)) {
if(is.null(names(patterns)))
replacement <- patterns
else
replacement <- names(patterns)
}
# Find matching levels
lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
for(i in seq_along(patterns))
lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)
# Append other non-matching levels
lvl_other <- setdiff(lvls, unlist(lvl_match))
lvl_all <- append(
lvl_match,
setNames(as.list(lvl_other), lvl_other)
)
return(lvl_all)
}
levels(citA$city) <- make_levels(citA$city, c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))
citA
#> num city
#> 1 1 BOERNE
#> 2 2 BOERNE
#> 3 3 BOERNE
#> 4 4 BOERNE
#> 5 5 GOLDEN
#> 6 6 GOLDEN
#> 7 7 GOLDEN
#> 8 8 GOLDEN
Created on 2020-04-20 by the reprex package (v0.3.0)由代表 package (v0.3.0) 于 2020 年 4 月 20 日创建
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.