简体   繁体   中英

PDF scraping using R

I have been using the XML package successfully for extracting HTML tables but want to extend to PDF's. From previous questions it does not appear that there is a simple R solution but wondered if there had been any recent developments

Failing that, is there some way in Python (in which I am a complete Novice) to obtain and manipulate pdfs so that I could finish the job off with the R XML package

Extracting text from PDFs is hard, and nearly always requires lots of care.

I'd start with the command line tools such as pdftotext and see what they spit out. The problem is that PDFs can store the text in any order, can use awkward font encodings, and can do things like use ligature characters (the joined up 'ff' and 'ij' that you see in proper typesetting) to throw you.

pdftotext is installable on any Linux system...

Your might want to check out the text mining package tm . I recall that they implemented so called readers, and there also was one for PDFs.

AFAIK there isn't an easy way of turning PDF tables into something useful for data analysis. You can use the Data Science Toolkit 's File to Text utility (R interface via the RDSTK package), then parse the resulting text. Be warned: the parsing is often non-trivial.


EDIT: There's a useful discussion of converting PDFs to XML on discerning.com . The short answer is that you will probably need to buy a commercial tool.

The heart of the tabula application that can extract tables from PDF documents is available as a simple command line Java application, tabula-extractor .

This Java app has been wrapped in R by the tabulizer package. Pass it the path to a PDF file and it will try to extract data tables for you and return them as data.

For an example, see When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM