[英]Create vocabulary of corpus from multiple txt files
我在玩R
我想从txt
文件创建dictionary
。 我有2个.txt文件为:
#1.txt
sky,
sun
#2.txt
blue,
bright
要将这2个文件加载到R
,我正在执行以下操作:
library(tm)
txt_files = list.files(pattern = '*.txt');
data = lapply(txt_files, read.table, sep = ",")
#here I get error
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 2 elements
In addition: Warning message:
In FUN(c("1.txt", "2.txt")[[1L]], ...) :
incomplete final line found by readTableHeader on '1.txt'
dict <- c(data)
#dict <- c("sky","blue","bright","sun") // original dictionary, want to replace this by above method
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf,dictionary = dict))
我收到以下错误:
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
有人可以告诉我,我做错了吗?
我认为您不应该对那些不规则的数据文件使用read.table
。 为什么不只使用readLines()
代替
txt_files <- list.files(pattern = '*.txt');
data <- lapply(txt_files, readLines)
dict <- gsub(",$","", unlist(data))
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(dd,
control = list(weighting = weightTfIdf,dictionary = dict))
inspect(dtm)
请注意,我们必须使用这种方法自己删除训练逗号,但这很简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.