简体   繁体   中英

Reading text file in numbering order for corpus from directory in R

docs <- Corpus(DirSource(cname))

I have a directory by cname which has the text files(1.txt,2.txt,....10.txt,11.txt,..) for which I want to create the corpus in numbering order(like 1,2,3,...,10,11..) but the corpus reads in the lexicographic order as 1,10,11,...19,2 so how can I make sure that the corpus reads the files in the directory in the ordered I require.

Thanks,

Here's something to try.

# simulate your file structure - you have this already
txt <- c("This is some text.", "This is some more text.","This is additional text.","Yet more additional text.")
num <- c(1,2,10,20)
td  <- tempdir()     # temporary directory
# creates 4 files in temp dir: 1.txt, 2.txt, 10.txt, and 20.txt
mapply(function(x,y) writeLines(x,paste0(td,"/",y,".txt")),txt,num)

# you start here...
library(tm)
src <- DirSource(directory=td, pattern=".txt")
names(Corpus(src))
# [1] "1.txt"  "10.txt" "2.txt"  "20.txt"
src$filelist <- src$filelist[order(as.integer(gsub("^.*/([0-9]+)\\.txt$","\\1",src$filelist)))]
names(Corpus(src))
# [1] "1.txt"  "2.txt"  "10.txt" "20.txt"

# clean up: just for this example
unlink(paste(td,"*.*",sep="/"))   # remove sample files...

So DirSource(...) returns an object of class DirSource , which has an element $filelist . This is a vector of file names (in the order you don't want). The code above (should) extract the file number preceding ".txt" , convert that to integer, and order filesource based on the integer values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM