R-如何简化特殊字符的文本清除？

Question

I suspect there is a way to simplify this text pre-preprocessing. 我怀疑有一种方法可以简化此文本的预处理。 However, I could not find a solution how to merge all these character replacements into a single row. 但是，我找不到如何将所有这些字符替换合并到一行中的解决方案。 Hence, to avoid all the repetition in my current solution (see below): 因此，为避免在当前解决方案中出现所有重复（请参见下文）：

Encoding(posts2$caption_clean) <- "UTF-8"
posts2$caption_clean <- iconv(posts2$caption_clean, "latin1", "UTF-8")
posts2$caption_clean <- gsub("Ã\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("â\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ð\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Â\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("å\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ð\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ñ\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ù\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("Ø\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("Ú\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("ì\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("Õ\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("ã\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("Û\\S*","",posts2$caption_clean) 
posts2$caption_clean <- gsub("ë\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("ê\\S*","",posts2$caption_clean)
posts2$caption_clean <- gsub("è¿½\\S*","",posts2$caption_clean)

Does anyone know how I can simplify this? 有谁知道我该如何简化？

Thanks! 谢谢！

Answer 1

# construct regex where each target pattern is a group ()
# enclose groups in [] to target any of those groups

regex <- "[(Ã\\S*)(â\\S*)(ð\\S*)]" 
string <- "Ã  x â x ð y "
gsub(regex, "", string)

result: 结果：

[1] "  x  x  y "

R-如何简化特殊字符的文本清除？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-12-14 09:23:28

R-如何简化特殊字符的文本清除？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-12-14 09:23:28

解决方案1
1 已采纳 2018-12-14 09:23:28