简体   繁体   中英

How can I convert web page with javascript to plain html?

I want to convert some web pages with javascript to plain html, and I found there several ways(pls tell me if I'm wrong):

  1. Use Jython, an example: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
  2. Use Java together with htmlunit
  3. Use a proxy, an example: http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
  4. Use python together with qt or PyV8

Because I want to make a tiny tool to meet my request, and I thought it somewhat complicated to install V8 and qt, although python is my first choice.

So I tried to make a proxy with gecko, but it seems need a DISPLAY which I can not afford in a remote Linux server.

Now I am trying to use jython, but it seems there is no simple way to just convert a whole page to plain html.

Actually, I want to ask is there a way to convert a web page contains javascript to plain html, just like the brower does. Can node.js do this job?

I've recently built a server on top of PhantomJS that does this. I highly recommend this route.

http://phantomjs.org/

Basically, you write a quick script that has PhantomJS run the page, and configure a trigger method that lets you know the page is finished and sends the data off. My version used the built-in HTTP server, so PhantomJS easily served up the results on its own. This takes about 15 lines of code to do. (Sorry, can't paste it here... wrote it on work time. But, check out the example on their home page. It's almost complete!)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM