简体   繁体   中英

R: Perl Regex for unicode character string

I am trying to get rid of some unicode character strings spread out in my data.

Sample data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"

I want to capture everything starting with a u'\\ and include the comma at the end.

I was thinking of starting with:

gsub("u/\\/\'....

+ everything including the next comma, but I'm not sure how to say that second part.

For a result of:

Sample data <- "['oguma', 'makeup', 'jeban',]"

suggestions?

Here is a regex solution that will remove the substrings starting with u' , followed with non-ASCII characters (1 or more) and end with a comma (optional, 1 or 0) and whitespaces (also optional, 0 or more):

data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"
gsub("u'[^[:ascii:]]+',?\\s*", "", data, perl=T)
## => [1] "['oguma', 'makeup', 'jeban',]"

See IDEONE demo

Note that the \ว -like substrings in your example are just non-ASCII characters that - if you print the string - will be displayed correctly as those letters/symbols (here, u'วิตามินหน้าเด็ก' , Thai for "vitamins for kids" - Google Translate).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM