[英]Trying to remove special characters and non-english words from my data R
我正在嘗試清除我的數據以將其刪除; i。)特殊字符(例如+ _),ii。)特定詞(例如轉推,追隨者,不可能,更好的人)iii。)未出現在英語詞典中的詞我正在使用Quanteda庫。 我的目標是獲得前50個二元組並將它們繪制在圖形上。
install.packages("textcat")
library(tm)
library(textcat)
the_data <- read.csv("twitterData.csv")
tweets_data <- the_data$x
tweets_corpus <- Corpus(VectorSource(tweets_data))
subSpace <- content_transformer(function(x, pattern) gsub(pattern,
" ", x))
twitterHandleRemover <- function(x) gsub("@\\S+","", x)
shortWordRemover <- function(x) gsub('\\b\\w{1,5}\\b','',x)
urlRemover <- function(x) gsub("http:[[:alnum:]]*","", x)
hashtagRemover <- function(x) gsub("#\\S+","", x)
tweets_corpus <- tm_map(tweets_corpus, subSpace, "/")
tweets_corpus <- tm_map(tweets_corpus, subSpace, "@")
tweets_corpus <- tm_map(tweets_corpus, subSpace, "\\|%&*#+_><")
tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))
tweets_corpus <- tm_map(tweets_corpus, removeNumbers)
tweets_corpus <- tm_map(tweets_corpus, content_transformer(urlRemover))
tweets_corpus <- tm_map(tweets_corpus,
content_transformer(shortWordRemover))
tweets_corpus <- tm_map(tweets_corpus,
content_transformer(twitterHandleRemover))
tweets_corpus <- tm_map(tweets_corpus,
content_transformer(hashtagRemover))
tweets_corp<- corpus(tweets_corpus)
tweets_dfm <- tokens(tweets_corp, remove_numbers = T,
remove_hyphens = T) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding=TRUE) %>%
tokens_remove(stopwords("english"), padding=TRUE) %>%
tokens_remove("\\d+", padding = TRUE) %>%
tokens_ngrams(n=2) %>% dfm()
topfeatures(tweets_dfm,50)
這是從我的代碼輸出:
我嘗試使用
specialChars <- function(x) gsub("[^[:alnum:]///']","", x)
tweets_corpus <- tm_map(tweets_corpus,
content_transformer(specialChars))
刪除特殊字符,但似乎刪除了所有字符-輸出為數字(0)
為什么不做這樣的事情:
> x <- "je n'aime pas ça"
> Encoding(x)
[1] "latin1"
> iconv(x, from = "latin1", to = "ASCII//TRANSLIT")
[1] "je n'aime pas ca"
假設您的數據在latin1中iconv(tweets_data, from = "latin1", to = "ASCII//TRANSLIT")
那么iconv(tweets_data, from = "latin1", to = "ASCII//TRANSLIT")
也是如此
接下來只保留字母數字字符或空格
gsub(pattern = "[^[:alnum:][:space:]]", " ", "<friends @symbols")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.