
[英]How do I merge the headers from one csv file with another csv file in R?
[英]How to distinguish bigrams and merge them into one CSV file in R Studio
好吧,所以我试图让R阅读句子,拉出bigrams,然后将所有这些bigrams合并为一个csv。 现在,我有一段代码可以提取一句话的二元组:
sentence=gsub('[[:punct:]]','', sentence)
sentence=gsub('[[:cntrl:]]','', sentence)
sentence=gsub('\\d+','', sentence)
sentence=tolower(sentence)
words<- strsplit(sentence, "\\s+")[[1]]
New=NULL
for(i in 1:length(words)-1){
New[i]=paste(words[i],words[i+1])
}
New=as.matrix(New)
colnames(New)<-"Bigrams"
但是,我希望能够导入一个由不同句子填充的csv,并让上一行代码为每个句子提取出bigrams,然后将它们合并到一个csv文件中。 我开始编写代码(如下),但这是不对的。 我将不胜感激我能得到的任何帮助。 R语言中自然语言处理的新手。
library(tm)
library(plyr)
library(stringr)
data<-read.csv("file.csv")
sentences=as.vector(data$text)
bigrams<-function(sentences){
bigrams2<-mlply(sentences,function(sentence){
sentence=gsub('[[:punct:]]','', sentence)
sentence=gsub('[[:cntrl:]]','', sentence)
sentence=gsub('\\d+','', sentence)
sentence=tolower(sentence)
words<- strsplit(sentence, "\\s+")[[1]]
New=NULL
for(i in 1:length(words)-1){
New[i]=paste(words[i],words[i+1])
}
New=as.matrix(New)
colnames(New)<-"Bigrams"
New
})
merge(bigrams2,all=TRUE)
}
谢谢!
这不是一个直接的答案,但是您可能会发现使用tm
和RWeka
的内置功能更简单:
library(RWeka) # for NGramTokenizer(...)
library(tm)
# sample data
data <- data.frame(text=c("This is some text.",
"This is some other text.",
"This is some punctuation; and some more, and more...",
"These are some numbers: 1,2,3,4, five."))
doc <- PlainTextDocument(data$text)
doc <- removeNumbers(doc)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)),
control = list(tokenize = BigramTokenizer))
result <- rownames(tdm)
result
# [1] "and more" "and some" "are some" "is some"
# [5] "more and" "numbers five" "other text" "punctuation and"
# [9] "some more" "some numbers" "some other" "some punctuation"
# [13] "some text" "these are" "this is"
编辑对OP评论的回复。
所以这是一种不使用NGramTokenizer
中的RWeka
。 此处使用bigrams(...)
函数的修改版本。 请注意,您必须明确删除标点符号。
bigrams <- function(text){
word.vec <- strsplit(text, "\\s+")[[1]]
sapply(1:(length(word.vec)-1), function(x)paste(word.vec[x], word.vec[x+1]))
}
doc <- PlainTextDocument(data$text)
doc <- removeNumbers(doc)
doc <- removePunctuation(doc)
tdm <- TermDocumentMatrix(Corpus(VectorSource(doc)),
control = list(tokenize = bigrams))
result.2 <- rownames(tdm)
identical(result,result.2)
# [1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.