I have a very messy data frame, with one column with values that are understandable to humans but not to computers, a bit like the one below.
df<-data.frame("id"=c(1:10),
"colour"=c("re d", ", red", "re-d","green", "gre, en", ", gre-en", "blu e", "green", ", blue", "bl ue"))
I can filter the df with str_detect
df %>% filter(str_detect(tolower(colour), pattern = "gr"))
But I want to rename all the filtered results to the same value so I can wrangle it.
Any suggestions?
I tried to separate with pattern but was unsuccessful.
EDIT: Not all . and spaces are unnecessary in the df that I am working with. Lets say that the correct way of writing green in the made up df is "gr. een".
EDIT2:
Wanted result with faked spelling of colours just to get an idea:
id colour
1 r. ed
2 r. ed
3 r. ed
4 gr. een
6 gr. een
7 gr. een
8 blu. e
9 gr. een
10 blu. e
You can use mgsub
to replace multiple words with multiple patterns:
df<-data.frame("id"=c(1:10),
"colour"=c("re d", ", red", "re-d","green", "gre, en",
", gre-en", "blu e", "green", ", blue", "bl ue"))
library(textclean)
df$colour = mgsub(df$colour,
pattern = c(".*gr.*", ".*re.*", ".*bl.*"),
replacement = c("gr. een", "r. ed", "blu. e"), fixed = F)
df
# id colour
# 1 1 r. ed
# 2 2 r. ed
# 3 3 r. ed
# 4 4 gr. een
# 5 5 gr. een
# 6 6 gr. een
# 7 7 blu. e
# 8 8 gr. een
# 9 9 blu. e
# 10 10 blu. e
Here are two solution for pre-processing your data, one is given in the comments already:
library(dplyr)
df %>%
mutate(colour2 = gsub("[^A-z]", "", colour))%>%
filter(str_detect(tolower(colour2), pattern = "green"))
Making the inverse thinking, you can use stringr
to extract the letters
library(stringr)
df %>%
mutate(colour2 = sapply(str_extract_all(df$colour,"[A-z]"),function(vec){paste0(vec,collapse = "")}))%>%
filter(str_detect(tolower(colour2), pattern = "green"))
Your selection will be more robust, and the column already renamed.
id colour colour2
1 4 green green
2 5 gre, en green
3 6 , gre-en green
4 8 green green
If you just want to rename all of the filtered results, how about:
df<-data.frame("id"=c(1:10),
"colour"=c("re d", ", red", "re-d","green", "gre, en", ", gre-en", "blu e", "green", ", blue", "bl ue"))
library(stringr)
df[str_detect(tolower(df[,"colour"]), pattern = "gr"), "colour"] <- "green"
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.