简体   繁体   中英

How can I scrape several pages of a pdf?

My code is like this:

df = tabula.read_pdf('test.pdf', pages = ['all'])[0]

df.head()

df.to_excel('test.xlsx')`

When I run it, I have just the first page in my Excel...

You read the whole pdf with all pages but you fetch the erst element.

df = tabula.read_pdf('test.pdf', pages = ['all'])[0]
                                                 ^^^

I think you have to remove that and concat it to get all pages to excel. Something like that:

dfs = tabula.read_pdf(self.file, pages='all')
df = pd.concat(dfs)
df.to_excel("filename.xlsx")

Here is a good article how to handle pdfs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM