简体   繁体   中英

scraping pdf files multiple pages from url

I want to scrape the information on this PDF in python. I'm not sure where to start because it isn't organized at all. I'm used to scraping HTML. I tried converting it to HTML and that didn't really help.

How would you try to scrape this PDF? Here is a link to the PDFs (any will work, they're all similar): https://portal.charitycommissioner.je/Public-Register/ https://www.gov.im/media/1371147/publicindex_latest-15121-v2.pdf

Thank you for any help :D

It is organized - it's in a "table" - pdfplumber works well for this.

pdf管道工示例

Once you have settings that correctly match your data you can .extract_table()

import pdfplumber
import pandas as pd

pdf = pdfplumber.open('file.pdf')

page = pdf.pages[0]
table = page.extract_table(
    dict(vertical_strategy="text", keep_blank_chars=True)
)

df = pd.DataFrame(table)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM