I have a dataset of a hundred million rows, out of which about 10 have some sort of Unicode replacement character. Text representation of this particular character is "< U+FFFD>" (remove whitespace), however there are others, too.
I want to remove the character, but i wasn't able to come up with a way to do that.
str <- "торгово производственн��я компания"
gsub("<U+FFFD>", "", str)
"торгово производственн��я компания"
If i need to provide any additional info please let me know. Also i would be very grateful for an explanation of what exactly is happening here (as in why a normal gsub doesn't work and why it displays like that)
You are using a gsub
function with a regex pattern as the first argument. <U+FFFD>
pattern matches <
, 1 or more U
symbols, and then a FFFD>
sequence of chars.
It would work like this:
> str2 <- "торгово <UUUFFFD> производственн��я компания"
> gsub("<U+FFFD>", "", str2)
[1] "торгово производственн��я компания"
Use a mere literal string replacement:
> str <- "торгово производственн��я компания"
> gsub("\uFFFD", "", str, fixed=TRUE)
[1] "торгово производствення компания"
This worked best for me when applying this same concept to an entire data frame.
# Remove embedded unicode characters in the data frame
df <- df %>%
mutate(across(where(is.character),~ str_remove_all(.,"\\s*\u200b")))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.