remove special/non-English characters from string in R

Question

I want to do some text mining analysis with my data collected from Facebook, but have some problems with the special/non-English characters in the text. The data looks like:

doc_id	text
001	'ð˜ð—¶ð˜€ ð˜ð—µð—² ð˜€ð—²ð—®ð˜€
002	I expect a return to normalcy...That is Bidenâ€™s great
003	'Iâ€™m facing a prison sentence

What I want is to remove the words containing these "strange" characters. I tried to do this by using

str_replace_all(text, "[^[:alnum:]]", " ")

But this doesn't work to my case. Any idea?

Answer 1

A general answer to this kind of tasks is to specify the characters you want to keep. It appears that :alnum: comprises the greek letters and letters with accents.

Maybe this regex is more appropriate:

str_remove_all(x, "[^[\\da-zA-Z ]]")

[1] ""

[1] "I expect a return to normalcyThat is Bidens great"

[1] "Im facing a prison sentence"

I just replaced the alpha shortcut by a-zA-Z . I added a whitespace and used the str_remove_all function instead. Add any character you want to keep.

remove special/non-English characters from string in R

Question

1 answers

solution1
0 2021-03-06 21:10:27

remove special/non-English characters from string in R

Question

1 answers

solution1 0 2021-03-06 21:10:27

solution1
0 2021-03-06 21:10:27