从多个txt文件创建语料库词汇

Question

我在玩R 我想从txt文件创建dictionary 。 我有2个.txt文件为：

#1.txt
 sky,
 sun

#2.txt
blue,
bright

要将这2个文件加载到R ，我正在执行以下操作：

library(tm)
txt_files = list.files(pattern = '*.txt');
data = lapply(txt_files, read.table, sep = ",")
 #here I get error
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
   line 2 did not have 2 elements
   In addition: Warning message:
   In FUN(c("1.txt", "2.txt")[[1L]], ...) :
   incomplete final line found by readTableHeader on '1.txt'
dict <- c(data)
#dict <- c("sky","blue","bright","sun") // original dictionary, want to replace this by above method
docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf,dictionary = dict))

我收到以下错误：

Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) : 
'x' must be atomic

有人可以告诉我，我做错了吗？

Answer 1

我认为您不应该对那些不规则的数据文件使用read.table 。 为什么不只使用readLines()代替

txt_files <- list.files(pattern = '*.txt');
data <- lapply(txt_files, readLines)
dict <- gsub(",$","", unlist(data))

docs <- c(D1 = "The sky is blue.", D2 = "The sun is bright.", D3 = "The sun in the sky is bright.")
dd <- Corpus(VectorSource(docs))
dtm <- DocumentTermMatrix(dd, 
    control = list(weighting = weightTfIdf,dictionary = dict)) 

inspect(dtm)

请注意，我们必须使用这种方法自己删除训练逗号，但这很简单。

从多个txt文件创建语料库词汇

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-06-04 19:46:07

从多个txt文件创建语料库词汇

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-06-04 19:46:07

解决方案1
1 已采纳 2014-06-04 19:46:07