[英]Going from corpus to individual .txt files in R's tm
我有一个具有6000行和2列的.csv文件,我想将每一行写为一个单独的文本文件。 关于如何在tm中完成任何想法? 我试过了writeCorpus()
但是该函数仅吐出150个.txt文件而不是6000。这是内存问题还是我的代码有误?
library(tm)
revs<-read.csv("dprpfinals.csv",header=TRUE)
corp<-Corpus(VectorSource(revs$Review))
writeCorpus(corp,path=".",filenames=paste(seq_along(revs),".txt",sep=""))
这是一个将文本拆分为多个段落,删除空行并将这些行写入文本文件的示例。 然后,您将需要处理文本文件。
txt="Argument split will be coerced to character, so you will see uses with split = NULL to mean split = character(0), including in the examples below.
Note that splitting into single characters can be done via split = character(0) ; the two are equivalent. The definition of 'character’ here depends on the locale: in a single-byte locale it is a byte, and in a multi-byte locale it is the unit represented by a ‘wide character’ (almost always a Unicode code point).
A missing value of split does not split the corresponding element(s) of x at all."
txt2<-data.frame(para = strsplit(txt, "\n")[[1]],stringsAsFactors=FALSE)
txt3<-txt2[txt2$para!="",]
npara = length(txt3)
for (ip in seq(1,npara)) {
fname = paste("paragraph_",ip,".txt",sep="")
fileConn<-file(fname)
writeLines(txt3[ip], fileConn)
close(fileConn)
}
无需为此使用tm
,这是一个可复制的示例,该示例将制作一个具有6000行两列的CSV文件,将其读入,然后将其转换为6000个txt文件
首先为示例准备一些数据...
# from http://hipsum.co/?paras=4&type=hipster-centric
txt <- "Brunch single-origin coffee photo booth, meggings fixie stumptown pickled mumblecore slow-carb aesthetic ennui Odd Future blog plaid Bushwick. Seitan keffiyeh hashtag Portland, kitsch irony authentic vegan post-ironic. Actually pop-up flexitarian kale chips ethical authentic, stumptown meggings. Photo booth Helvetica farm-to-table Neutra. Selfies blog swag, lomo viral meh chillwave distillery deep v Truffaut. Squid Cosby sweater irony, art party mustache Vice Wes Anderson Bushwick McSweeney's locavore roof party paleo. 3 wolf moon salvia gentrify, taxidermy street art banh mi Portland deep v small batch Truffaut."
# get n random samples of this paragraph
n <- 6000
txt_split <- unlist(strsplit(txt, split = " "))
txts <- sapply(1:n, function(i) paste(sample(txt_split, 10, replace = TRUE),
collapse = " "))
# make dataframe then CSV file, two cols, n rows.
my_csv <- data.frame( col_one = 1:n,
col_two = txts)
write.csv(my_csv, "my_csv.csv", row.names = FALSE, quote = TRUE)
现在,我们有了一个可能与您拥有的CSV文件相似的文件,我们可以通过以下方式读取它:
# Read in the CSV file...
x <- read.csv("my_csv.csv", header = TRUE, stringsAsFactors = FALSE)
现在,我们可以将CSV文件的每一行写入一个单独的文本文件(它们将出现在您的工作目录中):
# Write each row of the CSV to a txt file
sapply(1:nrow(x), function(i) write.table(paste(x[i,], collapse = " "),
paste0("my_txt_", i, ".txt"),
col.names = FALSE, row.names = FALSE))
如果您真的想使用tm
,那么您的方向正确,这对我来说很好:
# Read in the CSV file...
x <- read.csv("my_csv.csv", header = TRUE, stringsAsFactors = FALSE)
library(tm)
my_corpus <- Corpus(DataframeSource(x))
writeCorpus(my_corpus)
更接近您的示例对我来说也很好:
corp <- Corpus(VectorSource(x$col_one))
writeCorpus(corp)
如果对您不起作用,则可能是您的CSV文件,一些奇怪的字符等异常。 没有有关您的特定问题的更多详细信息,很难说。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.