R: Perl Regex for unicode character string

Question

I am trying to get rid of some unicode character strings spread out in my data.

Sample data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"

I want to capture everything starting with a u'\\ and include the comma at the end.

I was thinking of starting with:

gsub("u/\\/\'....

+ everything including the next comma, but I'm not sure how to say that second part.

For a result of:

Sample data <- "['oguma', 'makeup', 'jeban',]"

suggestions?

Answer 1

Here is a regex solution that will remove the substrings starting with u' , followed with non-ASCII characters (1 or more) and end with a comma (optional, 1 or 0) and whitespaces (also optional, 0 or more):

data <- "['oguma', 'makeup', u'\u0e27\u0e34\u0e15\u0e32\u0e21\u0e34\u0e19\u0e2b\u0e19\u0e49\u0e32\u0e40\u0e14\u0e47\u0e01', 'jeban',]"
gsub("u'[^[:ascii:]]+',?\\s*", "", data, perl=T)
## => [1] "['oguma', 'makeup', 'jeban',]"

See IDEONE demo

Note that the \ว -like substrings in your example are just non-ASCII characters that - if you print the string - will be displayed correctly as those letters/symbols (here, u'วิตามินหน้าเด็ก' , Thai for "vitamins for kids" - Google Translate).

R: Perl Regex for unicode character string

Question

1 answers

solution1
1 ACCPTED 2015-10-28 21:58:38

R: Perl Regex for unicode character string

Question

1 answers

solution1 1 ACCPTED 2015-10-28 21:58:38

solution1
1 ACCPTED 2015-10-28 21:58:38