简体   繁体   English

用 R 中的部分匹配替换整个单词或单词

[英]Replace whole word or words with partial match in R

I have a data frame with thousands of misspelled city names.我有一个包含数千个拼写错误的城市名称的数据框。 I need to correct these and can't find the solution though I've searched extensively.尽管我进行了广泛搜索,但我需要更正这些并且找不到解决方案。 I've tried several functions and approaches我尝试了几种功能和方法

This is a miniature sample of the data:这是数据的一个微型样本:

citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
               "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                          "GALDEN","GELDON","GOELDEN","GOLDEN"))

   num    city
1   1   BORNE
2   2 BOERNAE
3   3   BARNE
4   4  BOERNE
5   5  GALDEN
6   6  GELDON
7   7 GOELDEN
8   8  GOLDEN

These are some of the functions I've tried, tried many more including str_replace and str_detect:这些是我尝试过的一些功能,还尝试了更多功能,包括 str_replace 和 str_detect:

cit <- function(x){
  ifelse(x %in% grepl(c("BOR","BOE","BAR")),"BOERNE",
         ifelse(x %in% grepl(c("GAL","GEL","GOE")), "GOLDEN", "OTHER"))
}

Or或者

cit <- function(x){
  ifelse(x %in% c("BOR","BOE","BAR"),"BOERNE",
         ifelse(x %in% c("GAL","GEL","GOE"), "GOLDEN", "OTHER"))
}

Run code:运行代码:

`citA$city2 <- cit(citA$city)`

Incorrect result:结果不正确:

  num    city city2
1   1  BOERNE OTHER
2   2 BOERNAE OTHER
3   3   BARNE OTHER
4   4  BOERNE OTHER
5   5  GALDEN OTHER
6   6  GELDON OTHER
7   7 GOELDEN OTHER
8   8  GOLDEN OTHER

Also tried:也试过:

citA$city[grepl(c("BOR","BOE","BAR"),citA$city)] <- "BOERNE" 

But that throws an error:但这会引发错误:

Warning message:
In grepl(c("BOR", "BOE", "BAR"), citA$city) :
  argument 'pattern' has length > 1 and only the first element will be used

Your ideas would be greatly helpful!你的想法会很有帮助!

We can paste it to a single string for the pattern in grep with |我们可以将其pastegreppattern的单个字符串中,使用| (meaning OR ). (意思是OR )。 The pattern argument in grep is not vectorized ie it takes only a single element grep中的pattern参数未矢量化,即它只需要一个元素

citA$city[grepl(paste(c("BOR","BOE","BAR"), collapse="|"),citA$city)] <- "BOERNE" 
citA
#  num    city
#1   1  BOERNE
#2   2  BOERNE
#3   3  BOERNE
#4   4  BOERNE
#5   5  GALDEN
#6   6  GELDON
#7   7 GOELDEN
#8   8  GOLDEN

NOTE: The column 'city' is created as factor .注意: 'city' 列被创建为factor It should be a character class by making use of stringsAsFactors = FALSE它应该是一个character class 通过使用stringsAsFactors = FALSE

data数据

citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
           "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                      "GALDEN","GELDON","GOELDEN","GOLDEN"),
        stringsAsFactors = FALSE)

If you have many such patterns you can use case_when from dplyr :如果你有很多这样的模式,你可以使用case_when中的dplyr

library(dplyr)
library(stringr)

citA %>%
  mutate(city2 = case_when(str_detect(city, 'BOR|BOE|BAR') ~ 'BOERNE', 
                           str_detect(city, 'GAL|GEL|GOE|GOL') ~ 'GOLDEN',
                           TRUE ~ 'OTHER'))

#  num    city  city2
#1   1   BORNE BOERNE
#2   2 BOERNAE BOERNE
#3   3   BARNE BOERNE
#4   4  BOERNE BOERNE
#5   5  GALDEN GOLDEN
#6   6  GELDON GOLDEN
#7   7 GOELDEN GOLDEN
#8   8  GOLDEN GOLDEN

I've got a package on github that may help, that allows recoding of factor levels with regex matching.我在 github 上有一个 package 可能会有所帮助,它允许使用正则表达式匹配重新编码因子级别。 Load with package with用 package 加载

devtools::install_github("jwilliman/xfactor")

citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
                   "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                              "GALDEN","GELDON","GOELDEN","GOLDEN"))

citA$city2 <- xfactor::xfactor(citA$city, levels = c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))

citA
#>   num    city  city2
#> 1   1   BORNE BOERNE
#> 2   2 BOERNAE BOERNE
#> 3   3   BARNE BOERNE
#> 4   4  BOERNE BOERNE
#> 5   5  GALDEN GOLDEN
#> 6   6  GELDON GOLDEN
#> 7   7 GOELDEN GOLDEN
#> 8   8  GOLDEN GOLDEN

Created on 2020-04-20 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2020 年 4 月 20 日创建

Otherwise you could use the following function to clean/update the factor levels, uses a similar syntax.否则,您可以使用以下 function 来清理/更新因子水平,使用类似的语法。


  citA <- data.frame("num" = c(1,2,3,4,5,6,7,8),
                     "city" = c("BORNE","BOERNAE","BARNE","BOERNE",
                                "GALDEN","GELDON","GOELDEN","GOLDEN"))

make_levels <- function(.f, patterns, replacement = NULL, ignore.case = FALSE) {

  lvls <- levels(.f)

  # Replacements can be listed in the replacement argument, taken as names in patterns, or the patterns themselves.
  if(is.null(replacement)) {
    if(is.null(names(patterns)))
      replacement <- patterns
    else
      replacement <- names(patterns)
  }

  # Find matching levels
  lvl_match <- setNames(vector("list", length = length(patterns)), replacement)
  for(i in seq_along(patterns))
    lvl_match[[replacement[i]]] <- grep(patterns[i], lvls, ignore.case = ignore.case, value = TRUE)

  # Append other non-matching levels
  lvl_other <- setdiff(lvls, unlist(lvl_match))
  lvl_all <- append(
    lvl_match, 
    setNames(as.list(lvl_other), lvl_other)
  )

  return(lvl_all)

}

levels(citA$city) <- make_levels(citA$city, c(BOERNE = "BOR|BOE|BAR", GOLDEN = "GAL|GEL|GOE|GOL"))

citA
#>   num   city
#> 1   1 BOERNE
#> 2   2 BOERNE
#> 3   3 BOERNE
#> 4   4 BOERNE
#> 5   5 GOLDEN
#> 6   6 GOLDEN
#> 7   7 GOLDEN
#> 8   8 GOLDEN

Created on 2020-04-20 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2020 年 4 月 20 日创建

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM