简体   繁体   中英

How to get html of fully loaded page (with javascript) as input in java?

I need to parse page, everything is ok except some elements on page are loaded dynamically. I used jsoup for static elements, then when I realized that I really need dynamic elements I tried javafx. I read a lot of answeres on stackoverflow and there were many recommendations to use javafx WebEngine. So I ended with this code.

@Override
public void start(Stage primaryStage) {
    WebView webview = new WebView();
    final WebEngine webengine = webview.getEngine();
    webengine.getLoadWorker().stateProperty().addListener(
            new ChangeListener<State>() {
                public void changed(ObservableValue ov, State oldState, State newState) {
                    if (newState == Worker.State.SUCCEEDED) {
                        Document doc = webengine.getDocument();
                        //Serialize DOM
                        OutputFormat format    = new OutputFormat (doc); 
                        // as a String
                        StringWriter stringOut = new StringWriter ();    
                        XMLSerializer serial   = new XMLSerializer (stringOut, format);
                        try {
                            serial.serialize(doc);
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                        // Display the XML
                        System.out.println(stringOut.toString());
                    }
                }
            });
    webengine.load("http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658");
    primaryStage.setScene(new Scene(webview, 800, 800));
    primaryStage.show();
} 

I made string from org.w3c.dom.Document and printed it. But it was useless too. primaryStage.show() showed me fully loaded page (with element I need rendered on page), but there was no element I need in html code (in output).

This is the third day I'm working on that issue, of course lack of experience is my main problem, nevertheless I have to say: I'm stuck. This is my first java project after reading java complete reference. I make it to get some real-world experience (and for fun). I want to make parser of chinese "ebay".

Here is the problem and my test cases:

http://detail.tmall.com/item.htm?spm=a220o.1000855.0.0.PZSbaQ&id=19378327658 need to get dynamically loaded discount "129.00"

http://item.taobao.com/item.htm?spm=a230r.1.14.67.MNq30d&id=22794120348 need "15.20"

As you can see, if you view this pages with browser at first you see original price and after a second or so - discount.

Is it even possible to get this dynamic discounts from html page? Other elements I need to parse are static. What to try next: another library to render html with javascript or maybe smth else? I really need some advice, don't want to give up.

DOM model returned after Worker.State.SUCCEEDED shoulb be already processed by javascript.

Your code worked for me with tested with FX 7u40 and 8.0 dev. I see next output in the log:

<DIV id="J_PromoBox"><EM class="tb-promo-price-type">夏季新品</EM><EM class="tm-yen">¥</EM>    
<STRONG class="J_CurPrice">129.00</STRONG></DIV>

which is dynamically loaded box with data ( 129.00 ) you looked for.

You may want to upgrade your JDK to 7u40 or revisit your log parsing algorithm.

It sounds like you want the rendered DOM from a dynamic page after the Javascript on the page has finished modifying the original HTML. This would not be easy to do in Java as you would need to implement browser-like functionality with an embedded Javascript engine. If you only care about reading a web page from Java, you might want to look into Selenium since it takes control of a browser and allows you to pull the rendered HTML into Java.

This answer might also help:

Render JavaScript and HTML in (any) Java Program (Access rendered DOM Tree)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM