简体   繁体   English

从DocumentTermMatrix中删除单词

[英]Removing words from a DocumentTermMatrix

My friend and I are working on transforming some tweets we collected into a dtm in order to be able to run a sentiment analysis using machine learning in R. The task must be performed in R, because it is for an exam at our university, where R is required to be used as a tool. 我和我的朋友正在努力将我们收集的一些推文转换成dtm,以便能够使用R中的机器学习来运行情感分析。任务必须在R中执行,因为它是在我们大学的考试中进行的R需要用作工具。

Initially we have collected a smaller sample, in order to test if our code was working, before we would start coding a larger dataset. 在我们开始编写更大的数据集之前,我们最初收集了一个较小的样本,以测试我们的代码是否正常工作。 Our problem is that we can't seem to figure out how to remove custom words from the dtm. 我们的问题是我们似乎无法弄清楚如何从dtm中删除自定义单词。 Our code so far looks something like this (we are primarily using the tm package): 到目前为止,我们的代码看起来像这样(我们主要使用tm包):

 file <- read.csv("Tmix.csv",
           row.names = NULL, sep=";", header=TRUE) #just for loading the dataset

tweetsCorpus <- Corpus(VectorSource(file[,1]))

tweetsDTM <- DocumentTermMatrix(tweetsCorpus,
                                control = list(verbose = TRUE,
                                               asPlain = TRUE,
                                               stopwords = TRUE,
                                               tolower = TRUE,
                                               removeNumbers = TRUE,
                                               stemWords = FALSE,
                                               removePunctuation = TRUE,
                                               removeSeparators = TRUE,
                                               removeTwitter = TRUE,
                                               stem = TRUE,
                                               stripWhitespace = TRUE, 
                                               removeWords = c("customword1", "customword2", "customword3")))

We've also tried removing the words before converting into a dtm, using the removeWords command, together with all of the "removeXXX" commands in the tm package, and then converting it to a dtm, but it doesn't seem to work. 我们还尝试在转换为dtm之前删除单词,使用removeWords命令,以及tm包中的所有“removeXXX”命令,然后将其转换为dtm,但它似乎不起作用。

It is important that we don't simply remove all words with ie 5 or less observations. 重要的是我们不要简单地删除所有单词,即5个或更少的观察。 We need all observations, except the ones we want to remove like for instance https-adresses and stuff like that. 我们需要所有观察,除了我们想要删除的观察,例如https-adresses和类似的东西。

Does anyone know how we do this? 有谁知道我们是怎么做到的?

And a second question: Is there any easier way to remove all words that start with https instead of having to write all of the adresses individually into the code. 还有第二个问题:是否有更简单的方法可以删除所有以https开头的单词,而不必将所有地址单独写入代码中。 Right now for instance we are writing "httpstcokozcejeg", "httpstcolskjnyjyn", "httpstcolwwsxuem" as single custom words to remove from the data. 例如,我们正在编写"httpstcokozcejeg", "httpstcolskjnyjyn", "httpstcolwwsxuem"作为从数据中删除的单个自定义单词。

NOTE: We know that RemoveWords is a terrible solution to our problem, but we can't figure out how else to do it. 注意:我们知道RemoveWords对我们的问题是一个可怕的解决方案,但我们无法弄清楚如何做到这一点。

You can use regular expressions, for example: 您可以使用正则表达式,例如:

gsub("http[a-z]*","","httpstcolwwsxuem here")
[1] " here"

Assuming that you removed punctuation/digits in tweetsCorpus, you can use the following: 假设您删除了tweetsCorpus中的标点符号/数字,您可以使用以下内容:

1- Direct gsub 1-直接gsub

tweetsCorpus <- gsub("http[a-z]*","",tweetsCorpus[[1]][[1]])

OR 要么

2- tm::tm_map, content_transformer 2- tm :: tm_map,content_transformer

library(tm)

RemoveURL <- function(x){
        gsub("http[a-z]*","",x)
}

tweetsCorpus <- tm_map(tweetsCorpus, content_transformer(RemoveURL))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM