简体   繁体   中英

keep only alphanumeric characters and space in a string using gsub

I have a string which has alphanumeric characters, special characters and non UTF-8 characters. I want to strip the special and non utf-8 characters.

Here's what I've tried:

gsub('[^0-9a-z\\s]','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

However, This removes the special characters (punctuations + non utf8) but the output has no spaces.

gsub('/[^0-9a-z\\s]/i','',"�+ Sample string here =�{�>E�BH�P<]�{�>")

The result has spaces but there are still non utf8 characters present.

Any work around?

For the sample string above, output should be: Sample string here

You could use the classes [:alnum:] and [:space:] for this:

sample_string <- "�+ Sample 2 string here =�{�>E�BH�P<]�{�>"
gsub("[^[:alnum:][:space:]]","",sample_string)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Alternatively you can use PCRE codes to refer to specific character sets:

gsub("[^\\p{L}0-9\\s]","",sample_string, perl = TRUE)
#> [1] "ï Sample 2 string here ïïEïBHïPïï"

Both cases illustrate clearly that the characters still there, are considered letters. Also the EBHP inside are still letters, so the condition on which you're replacing is not correct. You don't want to keep all letters, you just want to keep AZ, az and 0-9:

gsub("[^A-Za-z0-9 ]","",sample_string)
#> [1] " Sample 2 string here EBHP"

This still contains the EBHP. If you really just want to keep a section that contains only letters and numbers, you should use the reverse logic: select what you want and replace everything but that using backreferences:

gsub(".*?([A-Za-z0-9 ]+)\\s.*","\\1", sample_string)
#> [1] " Sample 2 string here "

Or, if you want to find a string, even not bound by spaces, use the word boundary \\\\b instead:

gsub(".*?(\\b[A-Za-z0-9 ]+\\b).*","\\1", sample_string)
#> [1] "Sample 2 string here"

What happens here:

  • .*? fits anything (.) at least 0 times (*) but ungreedy (?). This means that gsub will try to fit the smallest amount possible by this piece.
  • everything between () will be stored and can be refered to in the replacement by \\\\1
  • \\\\b indicates a word boundary
  • This is followed at least once (+) by any character that's AZ, az, 0-9 or a space. You have to do it that way, because the special letters are contained in between the upper and lowercase in the code table. So using Az will include all special letters (which are UTF-8 btw!)
  • after that sequence,fit anything at least zero times to remove the rest of the string.
  • the backreference \\\\1 in combination with .* in the regex, will make sure only the required part remains in the output.

stringr may use a differrent regex engine that supports POSIX character classes. The :ascii: names the class, which must generally be enclosed in square brackets [:asciii:], whithin the outer square bracket. The [^ indicates negation of the match.

library(stringr)
str_replace_all("�+ Sample string here =�{�>E�BH�P<]�{�>", "[^[:ascii:]]", "")

result in [1] "+ Sample string here ={>EBHP<]{>"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM