[英]r - Remove Unicode replacement character from a string
I have a dataset of a hundred million rows, out of which about 10 have some sort of Unicode replacement character.我有一个一亿行的数据集,其中大约 10 行具有某种 Unicode 替换字符。 Text representation of this particular character is "< U+FFFD>" (remove whitespace), however there are others, too.
此特定字符的文本表示是“< U+FFFD>”(删除空格),但也有其他字符。
I want to remove the character, but i wasn't able to come up with a way to do that.我想删除角色,但我想不出办法。
str <- "торгово производственн��я компания"
gsub("<U+FFFD>", "", str)
"торгово производственн��я компания"
If i need to provide any additional info please let me know.如果我需要提供任何其他信息,请告诉我。 Also i would be very grateful for an explanation of what exactly is happening here (as in why a normal gsub doesn't work and why it displays like that)
另外,我将非常感谢您解释这里到底发生了什么(比如为什么普通的 gsub 不起作用以及为什么它会这样显示)
You are using a gsub
function with a regex pattern as the first argument. 您正在使用带有正则表达式模式的
gsub
函数作为第一个参数。 <U+FFFD>
pattern matches <
, 1 or more U
symbols, and then a FFFD>
sequence of chars. <U+FFFD>
模式匹配<
,1个或更多U
符号,然后匹配FFFD>
字符序列。
It would work like this: 它将像这样工作:
> str2 <- "торгово <UUUFFFD> производственн��я компания"
> gsub("<U+FFFD>", "", str2)
[1] "торгово производственн��я компания"
Use a mere literal string replacement: 仅使用文字字符串替换:
> str <- "торгово производственн��я компания"
> gsub("\uFFFD", "", str, fixed=TRUE)
[1] "торгово производствення компания"
This worked best for me when applying this same concept to an entire data frame.当将相同的概念应用于整个数据框时,这对我来说效果最好。
# Remove embedded unicode characters in the data frame
df <- df %>%
mutate(across(where(is.character),~ str_remove_all(.,"\\s*\u200b")))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.