简体   繁体   中英

Reading PDF text as String in R

install.packages("pdftools")

library("pdftools")

pdf.file <- "https://eparlib.nic.in/bitstream/123456789/809853/1/pms_16_17_07-02-2019_eng.pdf"

setwd("D:/Assignment 1/")

download.file(pdf.file, destfile = "speech1.pdf", mode = "wb")

pdf.text <- pdftools::pdf_text("speech1.pdf")

cat(pdf.text[[2]])

typeof(pdf.text)

I want to read the text as strings instead of characters. I was not able to find the ways to read it as strings instead it always ended up being read as characters.

The return value of pdftools::pdf_text is not a list, but a simple character:

> library(pdftools)
> x <- pdf_text("bla.pdf")
> typeof(x)
[1] "character"

You thus cannot use the index operator [[i]] on the return value. If you want to extract individual cahracters, you must use substr :

> substr(x, 3, 3)
[1] "l"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM