R not reading text in from PDF

Question

I've been trying to read a folder of PDFs into R to make a corpus for a while now. I've used:

teleeos<- readtext("C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth", encoding = "UTF-8")
directory<-("C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth")
teleeos<- readtext(directory)

and

setwd("C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth")
install.packages("pdftools")
library(pdftools)
files <- list.files(pattern = "pdf$")

and I've changed my PDF types, but all I keep getting is PDF error (63): Illegal character <29> in hex string no matter what I do. Any thoughts?

Answer 1

It would be worthwhile trying to isolate the file that is causing the problem and inspecting it further. Without a reproducible example or access to the original files, there is no way we can help further you with that.

First, try it without the encoding = "UTF-8" argument.

You could also try an alternate tool. Since I see you are using Windows, try this:

Download the xpdf suite of tools for your platform. This includes the part you need, pdftotext.
Use Windows PowerShell ISE (Integrated scripting environment) in Programs/Accessories as follows (with path adjustments as required for your system), to run this script.

It might convert your files to text better.

cd "C:/Users/dklimkina/Desktop/Text Analysis Project/Corpus/Telehealth"
$FILES = ls *.pdf
foreach ($f in $FILES) {
    pdftotext -enc UTF-8 $f
}

If that script fails, then if you succeed in isolating the problem pdf, try just running pdftotext problemfile.pdf on that file and see if that works.

R not reading text in from PDF

Question

1 answers

solution1
0 2020-05-05 19:12:14

R not reading text in from PDF

Question

1 answers

solution1 0 2020-05-05 19:12:14

solution1
0 2020-05-05 19:12:14