简体   繁体   中英

remove special/non-English characters from string in R

I want to do some text mining analysis with my data collected from Facebook, but have some problems with the special/non-English characters in the text. The data looks like:

doc_id text
001 'ð˜ð—¶ð˜€ ð˜ð—µð—² ð˜€ð—²ð—®ð˜€
002 I expect a return to normalcy...That is Biden’s great
003 'I’m facing a prison sentence

What I want is to remove the words containing these "strange" characters. I tried to do this by using

str_replace_all(text, "[^[:alnum:]]", " ")

But this doesn't work to my case. Any idea?

A general answer to this kind of tasks is to specify the characters you want to keep. It appears that :alnum: comprises the greek letters and letters with accents.

Maybe this regex is more appropriate:

str_remove_all(x, "[^[\\da-zA-Z ]]")

[1] ""

[1] "I expect a return to normalcyThat is Bidens great"

[1] "Im facing a prison sentence"

I just replaced the alpha shortcut by a-zA-Z . I added a whitespace and used the str_remove_all function instead. Add any character you want to keep.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM