[英]Changing spelling for multiple words at a time in R/replacing many words at once
我有一個數據集(調查)和一個出生國家列,人們在其中寫下了他們的出生國家。 一個例子:
1 america
2 usa
3 american
4 us of a
5 united states
6 england
7 english
8 great britain
9 uk
10 united kingdom
我希望它看起來如何:
1 america
2 america
3 america
4 america
5 america
6 uk
7 uk
8 uk
9 uk
10 uk
我嘗試使用 str_replace 手動插入不同的拼寫,用“美國”替換它們,但是當我查看我的數據集時,沒有任何改變,例如
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
survey$birth_country <- str_replace(survey$birth_country, ' "united state"|"united statea"|"united states of america"', "america")
先感謝您
想出一些只匹配每個國家的模式,基本上循環你已經在做的事情(你可以用你最喜歡的功能更改下面的替換)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
## use a _named_ list of regular expressions
## the name will be the replacement string
l <- list(
america = 'amer|us|states',
uk = 'eng|brit|king|uk',
'another country' = 'ano|an co',
chaz = 'chaz|chop'
)
f <- function(x, list) {
for (ii in seq_along(list)) {
x[grepl(list[[ii]], x, ignore.case = TRUE)] <- names(list)[ii]
}
x
}
## test it
f(survey$birth_country, l)
# [1] "america" "america" "america" "america" "america" "uk" "uk" "uk" "uk" "uk"
within(survey, {
clean <- f(birth_country, l)
})
# birth_country clean
# 1 america america
# 2 usa america
# 3 american america
# 4 us of a america
# 5 united states america
# 6 england uk
# 7 english uk
# 8 great britain uk
# 9 uk uk
# 10 united kingdom uk
請注意,1) 如果您不提供匹配的模式,則不會發生任何變化,但是 2) 如果您提供與兩個國家/地區都匹配的模式(例如,“united”),則將使用列表中的第一個(除非替換本身也匹配)
如果您允許 tidyverse 的變異,您可以執行以下操作:
library(tidyverse)
survey <- structure(list(birth_country = c("america", "usa", "american", "us of a", "united states", "england", "english", "great britain", "uk", "united kingdom")), row.names = c(NA, -10L), class = "data.frame")
americas <- c("america", "usa", "american", "us of a", "united states")
englands <- c("england", "english", "great britain")
survey %>%
mutate(birth_country = ifelse(birth_country %in% americas, 'america', 'UK'))
#> birth_country
#> 1 america
#> 2 america
#> 3 america
#> 4 america
#> 5 america
#> 6 UK
#> 7 UK
#> 8 UK
#> 9 UK
#> 10 UK
看起來問題在於您如何指定正則表達式。 試試這個(根據@Gabriella 的評論和另一種tidyverse 方法更新,類似於@MarBIo):
library(tidyverse)
survey <- survey %>%
mutate(birth_country = if_else(
str_detect(birth_country,
"(united state)|(united statea)|(united states of america)"), #If your regular expression matches any in birth_country
"america", #Change it to "america"
birth_country #Otherwise, keep as is.
) #end of if_else
) #end of mutate
其他人建議您提出一個更復雜的正則表達式,您當然也可以這樣做。 但是,正則表達式中的連續“或”(即“|”)語句有效。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.