简体繁体 English

自动生成HTTP屏幕抓取Java代码

[英]autogenerate HTTP screen scraping Java code

原文 2009-01-08 01:37:15 0 5 java/ http/ selenium/ screen-scraping

I need to screen scrape some data from a website, because it isn't available via their web service. 我需要屏蔽来自网站的一些数据，因为它不能通过他们的网络服务获得。 When I've needed to do this previously, I've written the Java code myself using Apache's HTTP client library to make the relevant HTTP calls to download the data. 当我以前需要这样做时，我自己编写了Java代码，使用Apache的HTTP客户端库来进行相关的HTTP调用以下载数据。 I figured out the relevant calls I needed to make by clicking through the relevant screens in a browser while using the Charles web proxy to log the corresponding HTTP calls. 我通过在浏览器中点击相关屏幕，同时使用Charles Web代理记录相应的HTTP调用，找出了我需要进行的相关调用。

As you can imagine this is a fairly tedious process, and I'm wodering if there's a tool that can actually generate the Java code that corresponds to a browser session. 您可以想象这是一个相当繁琐的过程，如果有一个工具可以实际生成与浏览器会话相对应的Java代码，那我就开始思考了。 I expect the generated code wouldn't be as pretty as code written manually, but I could always tidy it up afterwards. 我希望生成的代码不会像手动编写的代码一样漂亮，但我可以随后整理它。 Does anyone know if such a tool exists? 有谁知道这样的工具是否存在？ Selenium is one possibility I'm aware of, though I'm not sure if it supports this exact use case. Selenium是我所知的一种可能性，虽然我不确定它是否支持这个确切的用例。

Thanks, Don 谢谢，唐

5 个解决方案

I would also add +1 for HtmlUnit since its functionality is very powerful: if you are needing behaviour 'as though a real browser was scraping and using the page' that's definitely the best option available. 我还要为HtmlUnit添加+1，因为它的功能非常强大：如果你需要行为'就好像真正的浏览器正在抓取并使用页面'这绝对是最好的选择。 HtmlUnit executes (if you want it to) the Javascript in the page. HtmlUnit执行（如果你想要的话）页面中的Javascript。

It currently has full featured support for all the main Javascript libraries and will execute JS code using them. 它目前具有对所有主要Javascript库的全功能支持，并将使用它们执行JS代码。 Corresponding with that you can get handles to the Javascript objects in page programmatically within your test. 与此相对应，您可以在测试中以编程方式获取页面中Javascript对象的句柄。

If however the scope of what you are trying to do is less, more along the lines of reading some of the HTML elements and where you dont much care about Javascript, then using NekoHTML should suffice. 但是，如果你想要做的事情的范围更小，更多的是阅读一些HTML元素和你不太关心Javascript的地方，那么使用NekoHTML就足够了。 Its similar to JDom giving programmatic - rather than XPath - access to the tree. 它类似于JDom ，它提供程序化 - 而不是XPath - 访问树。 You would probably need to use Apache's HttpClient to retrieve pages. 您可能需要使用Apache的HttpClient来检索页面。

The manageability.org blog has an entry which lists a whole bunch of web page scraping tools for Java. manageability.org博客有一个条目，列出了一大堆用于Java的网页抓取工具。 However, I do not seem to be able to reach it right now, but I did find a text only representation in Google's cache here . 不过，我似乎并没有能够马上去实现它，但我没有找到在谷歌的缓存文本只表示这里。

You should take a look at HtmlUnit - it was designed for testing websites but works great for screen scraping and navigating through multiple pages. 您应该看看HtmlUnit - 它是专为测试网站而设计的，但非常适合屏幕抓取和浏览多个页面。 It takes care of cookies and other session-related stuff. 它负责cookie和其他与会话相关的东西。

我想说我个人喜欢使用HtmlUnit和Selenium作为我最喜欢的Screen Scraping工具。

A tool called The Grinder allows you to script a session to a site by going through its proxy. 一个名为The Grinder的工具允许您通过遍历其代理来编写会话脚本。 The output is Python (runnable in Jython). 输出是Python（在Jython中可运行）。