简体   繁体   English

如何在python中将网页转换为pdf,例如在打印中另存为pdf选项

[英]How to convert webpage to pdf in python like the save as pdf option in print

I have a website which requires login (authentication) that has a messages page which I want to convert all the comments to pdfs. 我有一个需要登录(身份验证)的网站,该网站有一个消息页面,我想将所有评论转换为pdf。 Originally I have been just clicking on every comment and choosing print in firefox browser and just saving the comment stream as pdf. 最初,我只是单击每个注释,然后在firefox浏览器中选择打印,然后将注释流另存为pdf。 The problem is there are so many, so I decided to write a python script but I am having issues. 问题是有很多,所以我决定写一个python脚本,但是我遇到了问题。 Here is my code: 这是我的代码:

import mechanize
import pdfkit
import os

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [("User-agent","Firefox")]
sign_in = br.open("www.mysite.com")

br.select_form(nr = 0)
br["username"] = "username"
br["password"] = "password"
logged_in = br.submit()

br.open("comments_page")
all_comment_links = []

# Iterate the links
for link in br.links():
  if "comment" in link.url:
    all_comment_links.append(link)

for l in all_comment_links:
  ret = br.open("comments_page").read()
  pdfkit.from_url(l.url, l.text + ".pdf")
  # pdfkit.from_string(ret, l.text + ".pdf")

  file = open(l.text + ".html", "w")
  file.write(ret)
  file.close()

# try from file
#for f in glob.glob("*.html"):
#  pdfkit.from_file(f, f.replace(".html", ".pdf"))

I am trying to use the pdfkit lib to convert each comment page to pdf but have been unsuccessful. 我正在尝试使用pdfkit lib将每个评论页面转换为pdf,但未成功。 I have tried using the url ( pdfkit.from_url ), just the string ( pdfkit.from_string ), and saving the html to a file ( pdfkit.from_file ) but cannot figure out why this isn't working. 我尝试使用url( pdfkit.from_url ),仅使用字符串( pdfkit.from_string ),并将html保存到文件( pdfkit.from_file ),但无法弄清楚为什么它不起作用。 As far as I know, the mechanize stuff works because my html files contain all the comments I want with the right content. 据我所知,机械化的东西起作用是因为我的html文件包含我想要的所有注释以及正确的内容。 I have looked around for different approaches but this is as far as I have gotten to what I want. 我到处寻找不同的方法,但这是我所想要的。

The script doesn't throw any errors, it just hangs with the first pdf like it cannot access the page/content. 该脚本不会引发任何错误,它只能与第一个pdf一起挂起,因为它无法访问页面/内容。 I have left it running for a while but only the first pdf file is created, but when I try to open it, it says it is corrupt. 我让它运行了一段时间,但只创建了第一个pdf文件,但是当我尝试打开它时,它说它已损坏。 Am I using the pdfkit wrong or should I be using something else to convert these pages to pdf? 我是否使用pdfkit错误?还是应该使用其他方式将这些页面转换为pdf? Thanks, and any help is appreciated. 谢谢,感谢您的帮助。 Running on mac os x. 在Mac OS X上运行。

My initial guess is that pdfkit does not receive any session info from mechanize so it tries to use pages behind authentication without being logged in. 我最初的猜测是pdfkit不会从机械化接收任何会话信息,因此它尝试使用身份验证后的页面而不登录。

You should probably first download html with mechanize then convert it locally. 您可能应该首先使用机械化下载html,然后将其本地转换。

However, since you say you are not getting results also from file, you should try interactive python shell and try apply pdfkit to a local file manually, see what error you get. 但是,由于您说您也无法从文件中获取结果,因此您应该尝试使用交互式python shell并尝试将pdfkit手动应用于本地文件,以查看出现什么错误。

Other thing may be that either of pdfkit inputs or output files are not in the directory you might reasonably expect them to, so should try absolute paths as parameters. 另一件事可能是pdfkit输入文件或输出文件不在您可能合理期望的目录中,因此应尝试使用绝对路径作为参数。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM