简体   繁体   中英

R not reading text in from PDF

I've been trying to read a folder of PDFs into R to make a corpus for a while now. I've used:

teleeos<- readtext("C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth", encoding = "UTF-8")
directory<-("C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth")
teleeos<- readtext(directory) 

and

setwd("C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth")
install.packages("pdftools")
library(pdftools)
files <- list.files(pattern = "pdf$")

and I've changed my PDF types, but all I keep getting is PDF error (63): Illegal character <29> in hex string no matter what I do. Any thoughts?

It would be worthwhile trying to isolate the file that is causing the problem and inspecting it further. Without a reproducible example or access to the original files, there is no way we can help further you with that.

First, try it without the encoding = "UTF-8" argument.

You could also try an alternate tool. Since I see you are using Windows, try this:

  1. Download the xpdf suite of tools for your platform. This includes the part you need, pdftotext.

  2. Use Windows PowerShell ISE (Integrated scripting environment) in Programs/Accessories as follows (with path adjustments as required for your system), to run this script.

It might convert your files to text better.

cd "C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth"
$FILES = ls *.pdf
foreach ($f in $FILES) {
    pdftotext -enc UTF-8 $f
}

If that script fails, then if you succeed in isolating the problem pdf, try just running pdftotext problemfile.pdf on that file and see if that works.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM