简体   繁体   English

如何获取完全加载页面的HTML(使用JavaScript)作为Java中的输入?

[英]How to get html of fully loaded page (with javascript) as input in java?

I need to parse page, everything is ok except some elements on page are loaded dynamically. 我需要解析页面,一切正常,除了页面上的某些元素是动态加载的。 I used jsoup for static elements, then when I realized that I really need dynamic elements I tried javafx. 我将jsoup用于静态元素,然后当我意识到我真的需要动态元素时,尝试了javafx。 I read a lot of answeres on stackoverflow and there were many recommendations to use javafx WebEngine. 我阅读了许多关于stackoverflow的答案,并且有许多使用javafx WebEngine的建议。 So I ended with this code. 所以我以这段代码结束了。

@Override
public void start(Stage primaryStage) {
    WebView webview = new WebView();
    final WebEngine webengine = webview.getEngine();
    webengine.getLoadWorker().stateProperty().addListener(
            new ChangeListener<State>() {
                public void changed(ObservableValue ov, State oldState, State newState) {
                    if (newState == Worker.State.SUCCEEDED) {
                        Document doc = webengine.getDocument();
                        //Serialize DOM
                        OutputFormat format    = new OutputFormat (doc); 
                        // as a String
                        StringWriter stringOut = new StringWriter ();    
                        XMLSerializer serial   = new XMLSerializer (stringOut, format);
                        try {
                            serial.serialize(doc);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                        // Display the XML
                        System.out.println(stringOut.toString());
                    }
                }
            });
    webengine.load("http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658");
    primaryStage.setScene(new Scene(webview, 800, 800));
    primaryStage.show();
} 

I made string from org.w3c.dom.Document and printed it. 我从org.w3c.dom.Document中创建了字符串并打印出来。 But it was useless too. 但这也没用。 primaryStage.show() showed me fully loaded page (with element I need rendered on page), but there was no element I need in html code (in output). primaryStage.show()向我显示了完全加载的页面(需要在页面上呈现的元素),但是在html代码中(输出中)没有我需要的元素。

This is the third day I'm working on that issue, of course lack of experience is my main problem, nevertheless I have to say: I'm stuck. 这是我处理该问题的第三天,当然,经验不足是我的主要问题,但是我不得不说:我被困住了。 This is my first java project after reading java complete reference. 这是我阅读Java完整参考资料后的第一个Java项目。 I make it to get some real-world experience (and for fun). 我这样做是为了获得一些现实世界的经验(并很有趣)。 I want to make parser of chinese "ebay". 我想对中文“ ebay”进行解析。

Here is the problem and my test cases: 这是问题和我的测试用例:

http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658 need to get dynamically loaded discount "129.00" http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658需要动态加载折扣“ 129.00”

http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348 need "15.20" http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348需要“ 15.20”

As you can see, if you view this pages with browser at first you see original price and after a second or so - discount. 如您所见,如果您首先使用浏览器查看此页面,则会看到原始价格,而第二秒左右便会看到折扣。

Is it even possible to get this dynamic discounts from html page? 是否有可能从html页面获得这种动态折扣? Other elements I need to parse are static. 我需要解析的其他元素是静态的。 What to try next: another library to render html with javascript or maybe smth else? 接下来要尝试的是:另一个使用javascript或其他方法呈现html的库? I really need some advice, don't want to give up. 我真的需要一些建议,不想放弃。

DOM model returned after Worker.State.SUCCEEDED shoulb be already processed by javascript. Worker.State.SUCCEEDED应该已经由javascript处理之后返回的DOM模型。

Your code worked for me with tested with FX 7u40 and 8.0 dev. 您的代码在FX 7u40和8.0开发人员的测试下对我有用。 I see next output in the log: 我在日志中看到下一个输出:

<DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM>    
<STRONG class="J_CurPrice">129.00</STRONG></DIV>

which is dynamically loaded box with data ( 129.00 ) you looked for. 这是动态加载的框,其中包含您要查找的数据( 129.00 )。

You may want to upgrade your JDK to 7u40 or revisit your log parsing algorithm. 您可能需要将JDK升级到7u40或重新访问日志解析算法。

It sounds like you want the rendered DOM from a dynamic page after the Javascript on the page has finished modifying the original HTML. 听起来好像您要在页面上的Javascript完成修改原始HTML之后从动态页面呈现的DOM。 This would not be easy to do in Java as you would need to implement browser-like functionality with an embedded Javascript engine. 用Java做到这一点并不容易,因为您需要使用嵌入式Javascript引擎实现类似浏览器的功能。 If you only care about reading a web page from Java, you might want to look into Selenium since it takes control of a browser and allows you to pull the rendered HTML into Java. 如果您只关心从Java读取网页,则可能要研究Selenium,因为它可以控制浏览器并允许将呈现的HTML拉入Java。

This answer might also help: 这个答案可能也有帮助:

Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)? 在(任何)Java程序中渲染JavaScript和HTML(访问渲染的DOM树)?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在java中获取部分来自jquery的页面的整个html - How to get, in java, the entire html of a page loaded in part from jquery 页面完全加载后读取页面源代码(执行JavaScript) - Read page source after page fully loaded (JavaScript executed) 如何用Java知道网页是否已完全加载 - How to know if a webpage is fully loaded or not in java 如何等待图像在Java中完全加载 - How to wait until an image is fully loaded in Java Selenium WebDriver,Java:如何通过比较两个元素来确定页面是否已完全加载? - Selenium WebDriver, Java: how to determine if a page is fully loaded by comparing two elements? Java SWT浏览器:等待动态页面完全加载 - Java SWT browser: Waiting till dynamic page is fully loaded 寻找选项来检查页面是否已满载 selenium 的 java 中的所有元素 - looking for option to check if page fully loaded with all elements in java for selenium 是否可以注册一个Java程序完全加载时触发的javascript事件? - Is it possible to register a javascript event that triggers when java applet is fully loaded? 如何在 Java 中使用 WebKit 从加载的页面中获取 html - How to obtain html from loaded page using WebKit in Java 如何延迟测试以便页面完全加载? - How can I make a delay in the test so that the page is fully loaded?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM