如何抓取pdf的幾頁？

Question

我的代碼是這樣的：

df = tabula.read_pdf('test.pdf', pages = ['all'])[0]

df.head()

df.to_excel('test.xlsx')`

當我運行它時，我的 Excel 中只有第一頁......

Answer 1

您閱讀了所有頁面的整個 pdf，但您獲取了第一個元素。

df = tabula.read_pdf('test.pdf', pages = ['all'])[0]
                                                 ^^^

我認為您必須刪除它並將其連接起來才能使所有頁面都表現出色。 類似的東西：

dfs = tabula.read_pdf(self.file, pages='all')
df = pd.concat(dfs)
df.to_excel("filename.xlsx")

這是一篇如何處理pdf的好文章