How to convert webpage to pdf in python like the save as pdf option in print

Question

I have a website which requires login (authentication) that has a messages page which I want to convert all the comments to pdfs. Originally I have been just clicking on every comment and choosing print in firefox browser and just saving the comment stream as pdf. The problem is there are so many, so I decided to write a python script but I am having issues. Here is my code:

import mechanize
import pdfkit
import os

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Firefox")]
sign_in = br.open("www.mysite.com")

br.select_form(nr = 0)
br["username"] = "username"
br["password"] = "password"
logged_in = br.submit()

br.open("comments_page")
all_comment_links = []

# Iterate the links
for link in br.links():
  if "comment" in link.url:
    all_comment_links.append(link)

for l in all_comment_links:
  ret = br.open("comments_page").read()
  pdfkit.from_url(l.url, l.text + ".pdf")
  # pdfkit.from_string(ret, l.text + ".pdf")

  file = open(l.text + ".html", "w")
  file.write(ret)
  file.close()

# try from file
#for f in glob.glob("*.html"):
#  pdfkit.from_file(f, f.replace(".html", ".pdf"))

I am trying to use the pdfkit lib to convert each comment page to pdf but have been unsuccessful. I have tried using the url ( pdfkit.from_url ), just the string ( pdfkit.from_string ), and saving the html to a file ( pdfkit.from_file ) but cannot figure out why this isn't working. As far as I know, the mechanize stuff works because my html files contain all the comments I want with the right content. I have looked around for different approaches but this is as far as I have gotten to what I want.

The script doesn't throw any errors, it just hangs with the first pdf like it cannot access the page/content. I have left it running for a while but only the first pdf file is created, but when I try to open it, it says it is corrupt. Am I using the pdfkit wrong or should I be using something else to convert these pages to pdf? Thanks, and any help is appreciated. Running on mac os x.

Answer 1

My initial guess is that pdfkit does not receive any session info from mechanize so it tries to use pages behind authentication without being logged in.

You should probably first download html with mechanize then convert it locally.

However, since you say you are not getting results also from file, you should try interactive python shell and try apply pdfkit to a local file manually, see what error you get.

Other thing may be that either of pdfkit inputs or output files are not in the directory you might reasonably expect them to, so should try absolute paths as parameters.

How to convert webpage to pdf in python like the save as pdf option in print

Question

1 answers

solution1
0 2017-10-18 13:58:44

How to convert webpage to pdf in python like the save as pdf option in print

Question

1 answers

solution1 0 2017-10-18 13:58:44

solution1
0 2017-10-18 13:58:44