Alright so I am trying to have R read sentences, pull out bigrams, and merge all of these bigrams together into one csv. Right now I have the code to pull out bigrams for one sentence:
sentence=gsub('[[:punct:]]','', sentence)
sentence=gsub('[[:cntrl:]]','', sentence)
sentence=gsub('\\d+','', sentence)
sentence=tolower(sentence)
words<- strsplit(sentence, "\\s+")[[1]]
New=NULL
for(i in 1:length(words)-1){
New[i]=paste(words[i],words[i+1])
}
New=as.matrix(New)
colnames(New)<-"Bigrams"
However, I want to be able to import a csv filled with different sentences and have the previous line of code pull out bigrams for each sentence and then merge them together into one csv file. I started writing a code (below) but it is not right. I would greatly appreciate any help I can get. Pretty new to natural language processing in R.
library(tm)
library(plyr)
library(stringr)
data<-read.csv("file.csv")
sentences=as.vector(data$text)
bigrams<-function(sentences){
bigrams2<-mlply(sentences,function(sentence){
sentence=gsub('[[:punct:]]','', sentence)
sentence=gsub('[[:cntrl:]]','', sentence)
sentence=gsub('\\d+','', sentence)
sentence=tolower(sentence)
words<- strsplit(sentence, "\\s+")[[1]]
New=NULL
for(i in 1:length(words)-1){
New[i]=paste(words[i],words[i+1])
}
New=as.matrix(New)
colnames(New)<-"Bigrams"
New
})
merge(bigrams2,all=TRUE)
}
Thanks!
Not a direct answer but you might find it simpler to use the built-in functionality of tm
and RWeka
for this:
library(RWeka) # for NGramTokenizer(...)
library(tm)
# sample data
data <- data.frame(text=c("This is some text.",
"This is some other text.",
"This is some punctuation; and some more, and more...",
"These are some numbers: 1,2,3,4, five."))
doc <- PlainTextDocument(data$text)
doc <- removeNumbers(doc)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)),
control = list(tokenize = BigramTokenizer))
result <- rownames(tdm)
result
# [1] "and more" "and some" "are some" "is some"
# [5] "more and" "numbers five" "other text" "punctuation and"
# [9] "some more" "some numbers" "some other" "some punctuation"
# [13] "some text" "these are" "this is"
EDIT Response to OP's comment.
So here is a method that doesn't use the NGramTokenizer
in RWeka
. The uses a modified version of the bigrams(...)
function here . Note that you have to explicitly remove punctuation.
bigrams <- function(text){
word.vec <- strsplit(text, "\\s+")[[1]]
sapply(1:(length(word.vec)-1), function(x)paste(word.vec[x], word.vec[x+1]))
}
doc <- PlainTextDocument(data$text)
doc <- removeNumbers(doc)
doc <- removePunctuation(doc)
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)),
control = list(tokenize = bigrams))
result.2 <- rownames(tdm)
identical(result,result.2)
# [1] TRUE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.