簡體   English   中英

在R中一次更改多個單詞的拼寫/一次替換多個單詞

[英]Changing spelling for multiple words at a time in R/replacing many words at once

我有一個數據集(調查)和一個出生國家列,人們在其中寫下了他們的出生國家。 一個例子:

    1 america
    2 usa
    3 american
    4 us of a
    5 united states
    6 england
    7 english
    8 great britain
    9 uk 
    10 united kingdom 

我希望它看起來如何:

1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk

我嘗試使用 str_replace 手動插入不同的拼寫,用“美國”替換它們,但是當我查看我的數據集時,沒有任何改變,例如

survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain",  "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")

survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")

先感謝您

想出一些只匹配每個國家的模式,基本上循環你已經在做的事情(你可以用你最喜歡的功能更改下面的替換)

survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain",  "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")

## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
  america = 'amer|us|states',
  uk = 'eng|brit|king|uk',
  'another country' = 'ano|an co',
  chaz = 'chaz|chop'
)

f <- function(x, list) {
  for (ii in seq_along(list)) {
    x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
  }
  x
}

## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk"      "uk"      "uk"      "uk"      "uk"     

within(survey, {
  clean <- f(birth_country, l)
})
#     birth_country   clean
# 1         america america
# 2             usa america
# 3        american america
# 4         us of a america
# 5   united states america
# 6         england      uk
# 7         english      uk
# 8   great britain      uk
# 9              uk      uk
# 10 united kingdom      uk

請注意,1) 如果您不提供匹配的模式,則不會發生任何變化,但是 2) 如果您提供與兩個國家/地區都匹配的模式(例如,“united”),則將使用列表中的第一個(除非替換本身也匹配)

如果您允許 tidyverse 的變異,您可以執行以下操作:

library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain",  "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")

americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>% 
  mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#>    birth_country
#> 1        america
#> 2        america
#> 3        america
#> 4        america
#> 5        america
#> 6             UK
#> 7             UK
#> 8             UK
#> 9             UK
#> 10            UK

看起來問題在於您如何指定正則表達式。 試試這個(根據@Gabriella 的評論和另一種tidyverse 方法更新,類似於@MarBIo):

library(tidyverse)    
survey <- survey %>%
    mutate(birth_country = if_else(
                str_detect(birth_country, 
                           "(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
                "america", #Change it to "america"
                birth_country #Otherwise, keep as is.
                ) #end of if_else
           ) #end of mutate

其他人建議您提出一個更復雜的正則表達式,您當然也可以這樣做。 但是,正則表達式中的連續“或”(即“|”)語句有效。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM