scraping pdf files multiple pages from url

Question

I want to scrape the information on this PDF in python. I'm not sure where to start because it isn't organized at all. I'm used to scraping HTML. I tried converting it to HTML and that didn't really help.

How would you try to scrape this PDF? Here is a link to the PDFs (any will work, they're all similar): https://portal.charitycommissioner.je/Public-Register/ https://www.gov.im/media/1371147/publicindex_latest-15121-v2.pdf

Thank you for any help :D

Answer 1

It is organized - it's in a "table" - pdfplumber works well for this.

Once you have settings that correctly match your data you can .extract_table()

import pdfplumber
import pandas as pd

pdf = pdfplumber.open('file.pdf')

page = pdf.pages[0]
table = page.extract_table(
    dict(vertical_strategy="text", keep_blank_chars=True)
)

df = pd.DataFrame(table)

scraping pdf files multiple pages from url

Question

1 answers

solution1
0 ACCPTED

scraping pdf files multiple pages from url

Question

1 answers

solution1 0 ACCPTED

solution1
0 ACCPTED