简体   繁体   English

从R中的推文中删除适当的英语单词

[英]Removing proper English words from tweets in R

I'm working on twitter data using R and am trying to remove all proper English words from the tweet. 我正在使用R处理Twitter数据,并试图从推文中删除所有正确的英语单词。 The idea is to look at the colloquial abbreviations, typos and slang used by a particular demographic whose tweets I have recorded. 我的想法是查看由我记录其推文的特定人群所使用的口语缩写,错别字和语。

Example: 例:

    tweet <- c("Trying to find the solution frustrated af")

After the above mentioned operation, I would like to have only 'af' 经过上述操作后,我只想拥有“ af”

I thought of washing the tweets against a dictionary (which I will download) but there must be a simpler alternative. 我想到了用字典(我将下载)清洗这些推文,但是必须有一个更简单的选择。 Any solution in Python would also help. Python中的任何解决方案也将有所帮助。

Another hunspell based solution using a rather new & interesting package : 另一个基于hunspell的解决方案,使用了一个相当有趣的新软件包

# install.packages("hunspell") # uncomment & run if needed
library(hunspell)
tweet <- c("Trying to find the solution frustrated af")
( tokens <- strsplit(tweet, " ")[[1]] )
# [1] "Trying"     "to"         "find"       "the"        "solution"   "frustrated" "af"        
tokens[!hunspell_check(tokens), dict = "en_US"]
# [1] "af"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM