I want to do some text mining analysis with my data collected from Facebook, but have some problems with the special/non-English characters in the text. The data looks like:
doc_id | text |
---|---|
001 | 'ð˜ð—¶ð˜€ ð˜ð—µð—² ð˜€ð—²ð—®ð˜€ |
002 | I expect a return to normalcy...That is Biden’s great |
003 | 'I’m facing a prison sentence |
What I want is to remove the words containing these "strange" characters. I tried to do this by using
str_replace_all(text, "[^[:alnum:]]", " ")
But this doesn't work to my case. Any idea?
A general answer to this kind of tasks is to specify the characters you want to keep. It appears that :alnum:
comprises the greek letters and letters with accents.
Maybe this regex is more appropriate:
str_remove_all(x, "[^[\\da-zA-Z ]]")
[1] ""
[1] "I expect a return to normalcyThat is Bidens great"
[1] "Im facing a prison sentence"
I just replaced the alpha shortcut by a-zA-Z
. I added a whitespace and used the str_remove_all
function instead. Add any character you want to keep.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.