[英]Reading text file in numbering order for corpus from directory in R
docs <- Corpus(DirSource(cname))
我有一個cname目錄,其中包含文本文件(1.txt,2.txt,.... 10.txt,11.txt,..),我想按編號順序創建語料庫(如1, 2,3,...,10,11 ..)但是語料庫在字典順序中讀取為1,10,11,... 19,2所以如何確保語料庫讀取目錄中的文件在訂購我要求。
謝謝,
這是一個值得嘗試的東西。
# simulate your file structure - you have this already
txt <- c("This is some text.", "This is some more text.","This is additional text.","Yet more additional text.")
num <- c(1,2,10,20)
td <- tempdir() # temporary directory
# creates 4 files in temp dir: 1.txt, 2.txt, 10.txt, and 20.txt
mapply(function(x,y) writeLines(x,paste0(td,"/",y,".txt")),txt,num)
# you start here...
library(tm)
src <- DirSource(directory=td, pattern=".txt")
names(Corpus(src))
# [1] "1.txt" "10.txt" "2.txt" "20.txt"
src$filelist <- src$filelist[order(as.integer(gsub("^.*/([0-9]+)\\.txt$","\\1",src$filelist)))]
names(Corpus(src))
# [1] "1.txt" "2.txt" "10.txt" "20.txt"
# clean up: just for this example
unlink(paste(td,"*.*",sep="/")) # remove sample files...
所以DirSource(...)
返回類DirSource
的對象,它有一個元素$filelist
。 這是文件名的向量(按您不想要的順序)。 上面的代碼(應)提取物之前的文件數".txt"
,將其轉換成整數,以便filesource
基於所述整數值。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.