繁体   English   中英

如何区分二元组并将其合并到R Studio中的一个CSV文件中

[英]How to distinguish bigrams and merge them into one CSV file in R Studio

好吧,所以我试图让R阅读句子,拉出bigrams,然后将所有这些bigrams合并为一个csv。 现在,我有一段代码可以提取一句话的二元组:

sentence=gsub('[[:punct:]]','', sentence)
    sentence=gsub('[[:cntrl:]]','', sentence)
    sentence=gsub('\\d+','', sentence)
    sentence=tolower(sentence)
    words<- strsplit(sentence, "\\s+")[[1]]
    New=NULL
    for(i in 1:length(words)-1){ 
      New[i]=paste(words[i],words[i+1])     
  }
New=as.matrix(New)
colnames(New)<-"Bigrams"

但是,我希望能够导入一个由不同句子填充的csv,并让上一行代码为每个句子提取出bigrams,然后将它们合并到一个csv文件中。 我开始编写代码(如下),但这是不对的。 我将不胜感激我能得到的任何帮助。 R语言中自然语言处理的新手。

library(tm)
library(plyr)
library(stringr)
data<-read.csv("file.csv")
sentences=as.vector(data$text)

bigrams<-function(sentences){

bigrams2<-mlply(sentences,function(sentence){
    sentence=gsub('[[:punct:]]','', sentence)
    sentence=gsub('[[:cntrl:]]','', sentence)
    sentence=gsub('\\d+','', sentence)
    sentence=tolower(sentence)
    words<- strsplit(sentence, "\\s+")[[1]]
    New=NULL
    for(i in 1:length(words)-1){ 
      New[i]=paste(words[i],words[i+1])     
   }
New=as.matrix(New)
colnames(New)<-"Bigrams"
New
})
merge(bigrams2,all=TRUE)

} 

谢谢!

这不是一个直接的答案,但是您可能会发现使用tmRWeka的内置功能更简单:

library(RWeka)   # for NGramTokenizer(...)
library(tm)
# sample data
data <- data.frame(text=c("This is some text.",
                          "This is some other text.",
                          "This is some punctuation; and some more, and more...",
                          "These are some numbers: 1,2,3,4, five."))

doc  <- PlainTextDocument(data$text)
doc  <- removeNumbers(doc)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)), 
                          control = list(tokenize = BigramTokenizer))
result <- rownames(tdm)
result
#  [1] "and more"         "and some"         "are some"         "is some"         
#  [5] "more and"         "numbers five"     "other text"       "punctuation and" 
#  [9] "some more"        "some numbers"     "some other"       "some punctuation"
# [13] "some text"        "these are"        "this is"         

编辑对OP评论的回复。

所以这是一种不使用NGramTokenizer中的RWeka 此处使用bigrams(...)函数的修改版本。 请注意,您必须明确删除标点符号。

bigrams <- function(text){
  word.vec <- strsplit(text, "\\s+")[[1]]
  sapply(1:(length(word.vec)-1), function(x)paste(word.vec[x], word.vec[x+1]))
}
doc  <- PlainTextDocument(data$text)
doc  <- removeNumbers(doc)
doc  <- removePunctuation(doc)
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)), 
                          control = list(tokenize = bigrams))
result.2 <- rownames(tdm)

identical(result,result.2)
# [1] TRUE

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM