简体   繁体   中英

How can I crawl an HTML5 website and convert its HTML content to PDF (using a Python or Ruby library)?

I'm looking for an engine/solution/framework/gem/egg/lib/whatever for either Ruby or Python to log into a website, crawl HTML5 content (mainly charts on a canvas), and be able to convert it into a PDF file (or image).

I'm able to write crawling scripts in mechanize so I can log onto the website and crawl the data, but mechanize does not understand complex JavaScript + HTML5.

So basically I'm looking for an HTML5/JavaScript interpreter.

This question is a bit confusing... sorry re-read my answer after reading the question again.

Your question has two parts:

1. How can I crawl a website

Crawling can be done using Mechinize, but as you said, it doesn't do Javascript very well. So one alternative is to use Capybara-webkit or Selenium (firefox / chrome).

Usually this is used for testing, but you may be able to drive it using Ruby code to navigate the various pages.

2. How can I convert the output to PDF

If you need to convert the crawled content to PDF, I don't think there is a way to do that. You may be able to take a screenshot (useful for testing) using Capybara-webkit or Selenium, but converting that to PDF may be just a matter of pumping it through some command line utility.

If you're looking for a true HTML to PDF converter (usually used to generate reports from views in a rails app), then use PDFKit

Basically it's a WebKit browser that can output to PDF. Really simple to run with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM