簡體   English   中英

如何在一組字符串中刪除 r 中的這些特殊字符:’s, ...

[英]How to remove these special characters in r in a set of string : ’s, …

我有這個包含特殊字符的字符串,我無法從主數據框中刪除這些字符,但是,當我通過 dft 准備一個單獨的對象然后使用以下代碼時,我能夠刪除特殊字符。

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

rmSpec <- "â|€|¦|â|€™|" # The "|" designates a logical OR in regular expressions.

s.rem <- gsub(rmSpec, "", dft) # gsub replace any matches in remSpec and replace them with "".
s.rem

但是當我在主數據框上使用相同的代碼時,以不同的行(推文)的形式如下所示,相同的代碼將不起作用並顯示錯誤:使用方法錯誤(“檢查”,x):不適用應用於“角色”類對象的“檢查”方法

[1] rt shibxwarrior hodl 信任過程一些偉大的事情地平線鄉親們shib shib shiba shibanu shibar……[2] rt askthedr 剛買了一美元值得的shib 認為它是robinhoodapp shibaarmy
[3] rt bitshiba 發送 shib 關注轉推推文 uufefufcd
[4] rt shibinform 想要 shib 上市 robinhoodappuf yes yes yes ubufef ubufef ubufef
[5] rt shiblucky shib 贈品 轉發關注

請您幫忙解決這個問題,謝謝。

只提取我們可能使用的字母和數字,

library(stringr)
    
dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

str_replace_all(dft, "[^a-zA-Z0-9]", " ")
[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar    rt askthedr just bought m usd worth shib think it   s robinhoodapp shibaarmy"

如前所述,首選方法是使用正確的編碼讀入數據,但有時您的數據只是損壞了。 我的 FixEncoding 函數創建了一個名為 vector 的查找來解決這個問題(我過去在處理錯誤存儲的舊 csv 文件時遇到的最多 3 個錯誤編碼。您可以使用所有 unicode 字符並錯誤地編碼。這樣您就可以將錯誤翻譯回來.

FixEncoding <- function() {
  # create the unicode ranges from https://www.i18nqa.com/debug/utf8-debug.html
  range <- c(sprintf("%x", seq(strtoi("0xa0"), strtoi("0xff"))))
  unicode <- vapply(range, FUN.VALUE = character(1), function(x) { parse(text = paste0("'\\u00", x, "'"))[[1]] })
  # add the ones that are missing (red ones in https://www.i18nqa.com/debug/utf8-debug.html)
  unicode <- c(c("\u0168", "\u0152", "\u017d", "\u0153", "\u017e", "\u0178", "\u2019", "\u20ac", "\u201a", "\u0192", "\u201e", "\u2026", "\u2020", "\u2021", "\u02c6", "\u2030", "\u0160", "\u2030"), unicode)
  once <- vapply(unicode, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_once <- unicode
  names(fix_once) <- once
  twice <- vapply(once, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_twice <- unicode
  names(fix_twice) <- twice
  triple <- vapply(twice, FUN.VALUE = character(1), function(x) { 
    Encoding(x) <- "Windows-1252"
    iconv(x, to = "UTF-8")
  })
  fix_triple <- unicode
  names(fix_triple) <- triple
  fixes <- c(fix_triple, fix_twice, fix_once)
  return(fixes)
}

fixes <- FixEncoding()

str(fixes)

 Named chr [1:342] "U" "Œ" "Ž" "œ" "ž" "Ÿ" "’" "\200" "‚" "ƒ" "„" "…" "†" "‡" "\210" "‰" "Š" "‰" " " "¡" "¢" "£" "¤" "¥" "¦" "§" "¨" "©" "ª" "«" "¬" "­" "®" "¯" "°" "±" "²" "³" "´" ...
 - attr(*, "names")= chr [1:342] "Ãâ\200¦Ã‚¨" "Ãâ\200¦Ã¢â‚¬â„¢" "Ãâ\200¦Ã‚½" "Ãâ\200¦Ã¢â‚¬Å“" ...

dft <- "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

stri_replace_all_fixed(dft, names(fixes), fixes, vectorize_all = F)

[1] "rt shibxwarrior hodl trust processsome great things horizon folks shib shib shiba shibainu shibar… rt askthedr just bought m usd worth shib think it’s robinhoodapp shibaarmy"

發生了什么的簡單例子

讓我們lélèlölã這是unicode的"\l\é\l\è\l\ö\l\ã"

messy <- "\u006C\u00E9\u006C\u00E8\u006C\u00F6\u006C\u00E3"

messy
# [1] "lélèlölã"

# this is just what happens if files get corrupt by saving in wrong encodings, we did that unicode character by unicode character in my fix function.
Encoding(messy) <- "Windows-1252"
messy <- iconv(messy, to = "UTF-8")

messy
# [1] "lélèlölã" # once badly encoded
# [1] "lélèlölã" # twice badly encoded
# [1] "lélèlölã" # three times!

# All three strings above would be fixed
stri_replace_all_fixed(messy, names(fixes), fixes, vectorize_all = F)
# [1] "lélèlölã"

# other simple replacements as suggested would give us `llll` or `lÃlÃlÃlÃ`?

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM