从 R 中的字符串中删除特殊/非英文字符

Question

I want to do some text mining analysis with my data collected from Facebook, but have some problems with the special/non-English characters in the text.我想对从 Facebook 收集的数据进行一些文本挖掘分析，但是文本中的特殊/非英文字符存在一些问题。 The data looks like:数据如下：

doc_id doc_id	text文本
001 001	'ð˜ð—¶ð˜€ ð˜ð—µð—² ð˜€ð—²ð—®ð˜€ 'ð~ð—¶ð~€ ð~ð—µð—² ð~€ð—²ð—®ð~€
002 002	I expect a return to normalcy...That is Bidenâ€™s great我期待恢复正常……这是拜登的伟大之处
003 003	'Iâ€™m facing a prison sentence “我正面临着牢狱之灾

What I want is to remove the words containing these "strange" characters.我想要的是删除包含这些“奇怪”字符的单词。 I tried to do this by using我试图通过使用来做到这一点

str_replace_all(text, "[^[:alnum:]]", " ")

But this doesn't work to my case.但这对我的情况不起作用。 Any idea?任何想法？

Answer 1

A general answer to this kind of tasks is to specify the characters you want to keep.此类任务的一般答案是指定要保留的字符。 It appears that :alnum: comprises the greek letters and letters with accents.看起来:alnum:由希腊字母和带重音符号的字母组成。

Maybe this regex is more appropriate:也许这个正则表达式更合适：

str_remove_all(x, "[^[\\da-zA-Z ]]")

[1] ""

[1] "I expect a return to normalcyThat is Bidens great"

[1] "Im facing a prison sentence"

I just replaced the alpha shortcut by a-zA-Z .我刚刚用a-zA-Z替换了 alpha 快捷方式。 I added a whitespace and used the str_remove_all function instead.我添加了一个空格并改用str_remove_all function。 Add any character you want to keep.添加您想要保留的任何字符。

从 R 中的字符串中删除特殊/非英文字符

问题描述

1 个解决方案

解决方案1
0 2021-03-06 21:10:27

从 R 中的字符串中删除特殊/非英文字符

问题描述

1 个解决方案

解决方案1 0 2021-03-06 21:10:27

解决方案1
0 2021-03-06 21:10:27