简体繁体 English

如何刮取HTTPS javascript网页

[英]How to scrape HTTPS javascript web pages

原文 2011-04-06 05:41:46 3 3 java/ javascript/ python/ https/ web-scraping

I am trying to monitor day-to-day prices from an online catalogue. 我试图通过在线目录监控日常价格。 The site uses HTTPS and generates the catalogue pages with javascript. 该站点使用HTTPS并使用javascript生成目录页面。 How can i interface with the site and make it generate the pages I need? 我如何与网站连接并使其生成我需要的页面？

I have done this with other sites where the HTML can easily be accessed, I have no problem parseing the HTML once generated. 我已经在其他可以轻松访问HTML的网站上完成了这项工作，生成后解析HTML没有问题。

I only know Python and Java. 我只懂Python和Java。

Thanks in advance. 提前致谢。

3 个解决方案

Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. 看看HTMLUnit - 一个可以完全由代码控制的无头Java浏览器。 A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html 这里可以看到一个简单的例子： http ： //htmlunit.sourceforge.net/gettingStarted.html

(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start) （强制警告：通过屏幕抓取网站，你可能会破坏它的ToS，并可能打开诉讼;检查你是否被允许在你开始之前这样做）

If they've created a Web API that their JavaScript interfaces with, you might be able to scrape that directly, rather than trying to go the HTML route. 如果他们创建了一个与他们的JavaScript接口的Web API，您可能可以直接删除它，而不是尝试使用HTML路由。

If they've obfuscated it or that option isn't available for some other reason, you'll basically need a Web browser to evaluate the JavaScript and then scrap the browser's DOM. 如果他们对它进行了模糊处理或者由于某些其他原因而无法使用该选项，那么您基本上需要一个Web浏览器来评估JavaScript，然后废弃浏览器的DOM。 Perhaps write a browser plugin? 也许写一个浏览器插件？

I use webkit through it's python bindings for scraping javascript content. 我使用webkit通过它的python绑定来抓取javascript内容。 See here for example . 例如，见这里。