简体   繁体   中英

Corpus reading from pdf OR text in R

I have a large list of files I want to read in R as a Corpus. All of the files were pdf, but recently, I realize some of them will be txt.

Before I had the text files, I was simply creating a list of pdf files that are in the directory and reading them using the Corpus function with readerControl:

getwd()
files <- list.files(pattern = "pdf$")
corp <- Corpus(URISource(files),
               readerControl = list(reader = readPDF))

I´ve tried to create a combined list of pdfs and txts, but I can´t find a way to use the readerContrl for pdf or txt:

files1 <- list.files(pattern = "pdf$")
files2 <- list.files(pattern = "txt$")
files<-c(files1,files2)

corp <- Corpus(URISource(files),
               readerControl = list(reader = c(readPDF,readPlain)))

Any ideas on how to solve this issue? I thought about merging two Copuses elements, one that reader=readPDF, another that reader=readPlain. But since I am new to text mining, I am not sure what is the best practice to do that.

Do it the easier way using the readtext package. If your mix of .txt and .pdf files are in the same subdirectory, call this path_to_your_files/ , then you can read them all in and then make them into a tm Corpus using readtext() . This function automagically recognises different input file types and converts them into UTF-8 text for text analysis in R. (The rtext object created here is a special type of data.frame that includes a document identifier column and a column called text that contains the converted text contents of your input documents.)

rtext <- readtext::readtext("path_to_your_files/*")
tm::Corpus(VectorSource(rtext[["text"]]))

readtext objects can also be used directly with the quanteda package as inputs to quanteda::corpus() if you wanted to try an alternative to tm .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM