[英]How to get html of fully loaded page (with javascript) as input in java?
I need to parse page, everything is ok except some elements on page are loaded dynamically. 我需要解析页面,一切正常,除了页面上的某些元素是动态加载的。 I used jsoup for static elements, then when I realized that I really need dynamic elements I tried javafx. 我将jsoup用于静态元素,然后当我意识到我真的需要动态元素时,尝试了javafx。 I read a lot of answeres on stackoverflow and there were many recommendations to use javafx WebEngine. 我阅读了许多关于stackoverflow的答案,并且有许多使用javafx WebEngine的建议。 So I ended with this code. 所以我以这段代码结束了。
@Override
public void start(Stage primaryStage) {
WebView webview = new WebView();
final WebEngine webengine = webview.getEngine();
webengine.getLoadWorker().stateProperty().addListener(
new ChangeListener<State>() {
public void changed(ObservableValue ov, State oldState, State newState) {
if (newState == Worker.State.SUCCEEDED) {
Document doc = webengine.getDocument();
//Serialize DOM
OutputFormat format = new OutputFormat (doc);
// as a String
StringWriter stringOut = new StringWriter ();
XMLSerializer serial = new XMLSerializer (stringOut, format);
try {
serial.serialize(doc);
} catch (IOException e) {
e.printStackTrace();
}
// Display the XML
System.out.println(stringOut.toString());
}
}
});
webengine.load("http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658");
primaryStage.setScene(new Scene(webview, 800, 800));
primaryStage.show();
}
I made string from org.w3c.dom.Document and printed it. 我从org.w3c.dom.Document中创建了字符串并打印出来。 But it was useless too. 但这也没用。 primaryStage.show() showed me fully loaded page (with element I need rendered on page), but there was no element I need in html code (in output). primaryStage.show()向我显示了完全加载的页面(需要在页面上呈现的元素),但是在html代码中(输出中)没有我需要的元素。
This is the third day I'm working on that issue, of course lack of experience is my main problem, nevertheless I have to say: I'm stuck. 这是我处理该问题的第三天,当然,经验不足是我的主要问题,但是我不得不说:我被困住了。 This is my first java project after reading java complete reference. 这是我阅读Java完整参考资料后的第一个Java项目。 I make it to get some real-world experience (and for fun). 我这样做是为了获得一些现实世界的经验(并很有趣)。 I want to make parser of chinese "ebay". 我想对中文“ ebay”进行解析。
Here is the problem and my test cases: 这是问题和我的测试用例:
http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658 need to get dynamically loaded discount "129.00" http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658需要动态加载折扣“ 129.00”
http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348 need "15.20" http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348需要“ 15.20”
As you can see, if you view this pages with browser at first you see original price and after a second or so - discount. 如您所见,如果您首先使用浏览器查看此页面,则会看到原始价格,而第二秒左右便会看到折扣。
Is it even possible to get this dynamic discounts from html page? 是否有可能从html页面获得这种动态折扣? Other elements I need to parse are static. 我需要解析的其他元素是静态的。 What to try next: another library to render html with javascript or maybe smth else? 接下来要尝试的是:另一个使用javascript或其他方法呈现html的库? I really need some advice, don't want to give up. 我真的需要一些建议,不想放弃。
DOM model returned after Worker.State.SUCCEEDED
shoulb be already processed by javascript. 在Worker.State.SUCCEEDED
应该已经由javascript处理之后返回的DOM模型。
Your code worked for me with tested with FX 7u40 and 8.0 dev. 您的代码在FX 7u40和8.0开发人员的测试下对我有用。 I see next output in the log: 我在日志中看到下一个输出:
<DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM>
<STRONG class="J_CurPrice">129.00</STRONG></DIV>
which is dynamically loaded box with data ( 129.00
) you looked for. 这是动态加载的框,其中包含您要查找的数据( 129.00
)。
You may want to upgrade your JDK to 7u40 or revisit your log parsing algorithm. 您可能需要将JDK升级到7u40或重新访问日志解析算法。
It sounds like you want the rendered DOM from a dynamic page after the Javascript on the page has finished modifying the original HTML. 听起来好像您要在页面上的Javascript完成修改原始HTML之后从动态页面呈现的DOM。 This would not be easy to do in Java as you would need to implement browser-like functionality with an embedded Javascript engine. 用Java做到这一点并不容易,因为您需要使用嵌入式Javascript引擎实现类似浏览器的功能。 If you only care about reading a web page from Java, you might want to look into Selenium since it takes control of a browser and allows you to pull the rendered HTML into Java. 如果您只关心从Java读取网页,则可能要研究Selenium,因为它可以控制浏览器并允许将呈现的HTML拉入Java。
This answer might also help: 这个答案可能也有帮助:
Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)? 在(任何)Java程序中渲染JavaScript和HTML(访问渲染的DOM树)?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.