如何在一組字符串中刪除 r 中的這些特殊字符：â€™s, ...

Question

我有這個包含特殊字符的字符串，我無法從主數據框中刪除這些字符，但是，當我通過 dft 准備一個單獨的對象然后使用以下代碼時，我能夠刪除特殊字符。

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy"

rmSpec <- "â|€|¦|â|€™|" # The "|" designates a logical OR in regular expressions.

s.rem <- gsub(rmSpec, "", dft) # gsub replace any matches in remSpec and replace them with "".
s.rem

但是當我在主數據框上使用相同的代碼時，以不同的行（推文）的形式如下所示，相同的代碼將不起作用並顯示錯誤：使用方法錯誤（“檢查”，x）：不適用應用於“角色”類對象的“檢查”方法

[1] rt shibxwarrior hodl 信任過程一些偉大的事情地平線鄉親們shib shib shiba shibanu shibar……[2] rt askthedr 剛買了一美元值得的shib 認為它是robinhoodapp shibaarmy
[3] rt bitshiba 發送 shib 關注轉推推文 uufefufcd
[4] rt shibinform 想要 shib 上市 robinhoodappuf yes yes yes ubufef ubufef ubufef
[5] rt shiblucky shib 贈品轉發關注

請您幫忙解決這個問題，謝謝。

Answer 1

只提取我們可能使用的字母和數字，

library(stringr)
    
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy"

str_replace_all(dft, "[^a-zA-Z0-9]", " ")
[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar    rt askthedr just bought m usd worth shib think it   s robinhoodapp shibaarmy"

Answer 2

如前所述，首選方法是使用正確的編碼讀入數據，但有時您的數據只是損壞了。 我的 FixEncoding 函數創建了一個名為 vector 的查找來解決這個問題（我過去在處理錯誤存儲的舊 csv 文件時遇到的最多 3 個錯誤編碼。您可以使用所有 unicode 字符並錯誤地編碼。這樣您就可以將錯誤翻譯回來.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

str(fixes)

 Named chr [1:342] "U" "Œ" "Ž" "œ" "ž" "Ÿ" "’" "\200" "‚" "ƒ" "„" "…" "†" "‡" "\210" "‰" "Š" "‰" " " "¡" "¢" "£" "¤" "¥" "¦" "§" "¨" "©" "ª" "«" "¬" "" "®" "¯" "°" "±" "²" "³" "´" ...
 - attr(*, "names")= chr [1:342] "Ãƒâ\200¦Ã‚Â¨" "Ãƒâ\200¦Ã¢â‚¬â„¢" "Ãƒâ\200¦Ã‚Â½" "Ãƒâ\200¦Ã¢â‚¬Å“" ...

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibarâ€¦ rt askthedr just bought m usd worth shib think itâ€™s robinhoodapp shibaarmy"

stri_replace_all_fixed(dft, names(fixes), fixes, vectorize_all = F)

[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

發生了什么的簡單例子

讓我們lélèlölã這是unicode的"\l\é\l\è\l\ö\l\ã"

messy <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"

messy
# [1] "lélèlölã"

# this is just what happens if files get corrupt by saving in wrong encodings, we did that unicode character by unicode character in my fix function.
Encoding(messy) <- "Windows-1252"
messy <- iconv(messy, to = "UTF-8")

messy
# [1] "lÃ©lÃ¨lÃ¶lÃ£" # once badly encoded
# [1] "lÃƒÂ©lÃƒÂ¨lÃƒÂ¶lÃƒÂ£" # twice badly encoded
# [1] "lÃƒÆ’Ã‚Â©lÃƒÆ’Ã‚Â¨lÃƒÆ’Ã‚Â¶lÃƒÆ’Ã‚Â£" # three times!

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã"

# other simple replacements as suggested would give us `llll` or `lÃlÃlÃlÃ`?

如何在一組字符串中刪除 r 中的這些特殊字符：â€™s, ...

問題描述

2 個解決方案

解決方案1
0 2021-12-09 14:33:07

解決方案2
0 2021-12-09 15:05:38

如何在一組字符串中刪除 r 中的這些特殊字符：â€™s, ...

問題描述

2 個解決方案

解決方案1 0 2021-12-09 14:33:07

解決方案2 0 2021-12-09 15:05:38

解決方案1
0 2021-12-09 14:33:07

解決方案2
0 2021-12-09 15:05:38