簡體 English 中英

從 url 抓取 pdf 文件的多個頁面

[英]scraping pdf files multiple pages from url

原文 2022-02-15 12:13:09 7 1 python/ pandas

我想用 python 抓取這個 PDF 的信息。 我不知道從哪里開始，因為它根本沒有組織。 我習慣於抓取 HTML。 我嘗試將其轉換為 HTML，但這並沒有真正幫助。

您將如何嘗試抓取此 PDF？ 這是 PDF 的鏈接（任何都可以，它們都相似）： https://portal.charitycommissioner.je/Public-Register/ https://www.gov.im/media/1371147/publicindex_latest-15121 -v2.pdf

謝謝你的幫助：D

1 個解決方案

它是有組織的 - 它在一個“表格”中 - pdfplumber很適合這個。

一旦您的設置與您的數據正確匹配，您就可以.extract_table()

import pdfplumber
import pandas as pd

pdf = pdfplumber.open('file.pdf')

page = pdf.pages[0]
table = page.extract_table(
    dict(vertical_strategy="text", keep_blank_chars=True)
)

df = pd.DataFrame(table)

Python從URL抓取pdf

[英]Python scraping pdf from URL

將多個 pdf 文件中的特定頁面寫入新的 pdf 文件

[英]Write specific pages from multiple pdf files to a new pdf file

從多個頁面抓取評論

[英]scraping reviews from multiple pages

使用 BeautifulSoup 使用不變的 URL 抓取多個頁面

[英]Scraping multiple pages with an unchanging URL using BeautifulSoup

抓取：從網址下載文件

[英]scraping: download files from url

具有多個頁面但來自單個 url 的刮表

[英]Scraping Table with mulitple pages but from single url

從多個URL刮取數據

[英]Scraping data from multiple URL

Python - 從 URL 中抓取 PDF 文件

[英]Python - Scraping a PDF file from a URL

在Python中從多個網頁刮取文本

[英]Scraping text from multiple web pages in Python

Web 使用 for 循環從多個頁面中抓取

[英]Web scraping from multiple pages with for loop

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 Python從URL抓取pdf 將多個 pdf 文件中的特定頁面寫入新的 pdf 文件從多個頁面抓取評論使用 BeautifulSoup 使用不變的 URL 抓取多個頁面抓取：從網址下載文件具有多個頁面但來自單個 url 的刮表從多個URL刮取數據 Python - 從 URL 中抓取 PDF 文件在Python中從多個網頁刮取文本 Web 使用 for 循環從多個頁面中抓取

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM