[英]Removing proper English words from tweets in R
I'm working on twitter data using R and am trying to remove all proper English words from the tweet. 我正在使用R处理Twitter数据,并试图从推文中删除所有正确的英语单词。 The idea is to look at the colloquial abbreviations, typos and slang used by a particular demographic whose tweets I have recorded.
我的想法是查看由我记录其推文的特定人群所使用的口语缩写,错别字和语。
Example: 例:
tweet <- c("Trying to find the solution frustrated af")
After the above mentioned operation, I would like to have only 'af' 经过上述操作后,我只想拥有“ af”
I thought of washing the tweets against a dictionary (which I will download) but there must be a simpler alternative. 我想到了用字典(我将下载)清洗这些推文,但是必须有一个更简单的选择。 Any solution in Python would also help.
Python中的任何解决方案也将有所帮助。
Another hunspell based solution using a rather new & interesting package : 另一个基于hunspell的解决方案,使用了一个相当有趣的新软件包 :
# install.packages("hunspell") # uncomment & run if needed
library(hunspell)
tweet <- c("Trying to find the solution frustrated af")
( tokens <- strsplit(tweet, " ")[[1]] )
# [1] "Trying" "to" "find" "the" "solution" "frustrated" "af"
tokens[!hunspell_check(tokens), dict = "en_US"]
# [1] "af"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.