简体   繁体   English

如何抓取pdf的几页?

[英]How can I scrape several pages of a pdf?

My code is like this:我的代码是这样的:

df = tabula.read_pdf('test.pdf', pages = ['all'])[0]

df.head()

df.to_excel('test.xlsx')`

When I run it, I have just the first page in my Excel...当我运行它时,我的 Excel 中只有第一页......

You read the whole pdf with all pages but you fetch the erst element.您阅读了所有页面的整个 pdf,但您获取了第一个元素。

df = tabula.read_pdf('test.pdf', pages = ['all'])[0]
                                                 ^^^

I think you have to remove that and concat it to get all pages to excel.我认为您必须删除它并将其连接起来才能使所有页面都表现出色。 Something like that:类似的东西:

dfs = tabula.read_pdf(self.file, pages='all')
df = pd.concat(dfs)
df.to_excel("filename.xlsx")

Here is a good article how to handle pdfs 是一篇如何处理pdf的好文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM