How to dissect an HTML page in Java, to pick out certain elements?

Question

For reasons that I don't want to go into for the purpose of this question, I have a Java class that posts an HTML form, and reads in the response.

A small snippet of the response I'm getting is :

<div class="main_box">

  <table width="100%" border="0" cellspacing="4" cellpadding="4" class='results'>
    <tr>
        <td colspan="3" class="title">Free Car ID Check Results</td>
    </tr>
    <tr>
        <td class='title' width='34%'>Vehicle Registration Number</td>
        <td width="43%">ABC123</td>
        <td width="23%" rowspan="4" valign="top"><p align="center"><img src="/media/FORD.jpg" alt="FORD" /></p>
        <p>      </p></td>
    </tr>
    <tr>
        <td  class='title'>Make</td>
        <td>FORD</td>
    </tr>
    <tr>
        <td class='title'>Model</td>
        <td>ESCORT</td>
    </tr>
    <tr>
        <td class='title'>Colour</td>
        <td>BLUE</td>
    </tr>
  </table>

</div>

What would be the easiest, robust way of picking out the make , model and colour from this? This is only a small portion of the input stream I'm reading, and I can't guarantee that the HTML elements outside of this will remain the same, as in the web page may alter.

Thanks

Answer 1

Use an HTML parser like JSoup . It allows you to easily read the document and select elements.

Eg

Document doc = Jsoup.connect("http://url").get();
Elements elements = doc.select("div[class=main_box] td[class=title]");
for (Element anElement : elements) {
    // Real treatment here
    System.out.print(anElement.text());
    System.out.println(": "+anElement.nextElementSibling().text());
}

Answer 2

The "easy" way, which I have used to grab data from web sites, is to carefully analyze their HTML, then just search for something distinctive like ">Make<" , then search for the next "<td>" , then the next "</td>" , and grab what's in between.

This is obviously highly non-robust if they have any escape characters, if there is more than one instance of ">Make<", etc. Or when they change their output in the future.

However, the "robust" methods with fancy XHTML parsers etc. typically assume that the web site is serving back well formed HTML or XHTML . In my experience, nobody serves back well formed HTML . :-( Well, not many... Arguably, my quick and dirty way is more robust than using a real parser.

ps - for those SO experts who will offer up real answers with real parsers, please describe how they handle poorly formed HTML, as I had real problems with that...

Answer 3

In a comment, I promised @his that I would try JSoup and compare it to my more hacky "just search for >Make<" style code (which is in a small class I wrote called HTMLGrabber.)

First, I found JSoup easy to use, it handled at least one of the lousy HTML files I tested (there are three more to be tested). The resulting code was similar in length to the HTMLGrabber code. A bit longer, but not bad. And HTMLGrabber isn't nearly as simple as I remember, as I've added some unescape/escape code, minor support for Attributes, etc...

Any "scraping" approach is ultimately non-robust if the web site changes dramatically.

The "advantages" of the HTMLGrabber style code is that you are searching based directly on the content. In the car code example, you'd probably jump first to "Free Car ID Check Results", then look for ">Make<", then "<td>" and grab the text before the next "</td>" , yielding "FORD". Then similar for ">Model<" and ">Color<". Why is this possibly an "advantage"? If the structure of the HTML changes, eg it's not in a table anymore, or more rows are added, this approach might still work. ie, it is "more robust" (but still far from perfect) in the face of structural changes in the HTML.

The advantage of the JSoup/"real parser" approach is that it handles goofy escape characters, plus, normally, (at least, how I would code it, YMMV) you would be following the structure of the HTML, at least in part, to find the things you want. In the car example, you'd look for the div element with class "main_box", then find the table, then the rows etc... This approach is more robust in the face of content changes. For example, when your website gets bought out, and "Free Car ID Check Results" changes to "Facebook Car ID Check Results", this would still work. (Note that nothing is perfect, if "main_box" changed to "primary_box" you'd be in trouble).

I have no idea if content or structural changes are more frequent in the random websites people are scraping. Anybody have any stats or experiences there?

In summary, I found JSoup "easy enough" that I will use it most of the time in the future, since I suspect that it is, in general, more robust. But, for many web sites, the "just grab it" approach may be superior.

ADDENDUM For two of my web pages, the HTML was so jumbled that, even though Jsoup managed to parse it, using Jsoup to go down through the DOM proved so difficult that I stuck with the quick-and-dirty instead.

Answer 4

try this "http://developer.yahoo.com/dotnet/howto-xml_vb.html" it is in microsoft language but it can be usefull if you'r willing to translate from one language to an other. Good Luck!

How to dissect an HTML page in Java, to pick out certain elements?

Question

4 answers

solution1
2 2012-02-02 00:17:57

solution2
0 2012-02-02 00:31:08

solution3
0 2012-02-03 19:07:07

solution4
-1 2012-02-11 16:47:20

How to dissect an HTML page in Java, to pick out certain elements?

Question

4 answers

solution1 2 2012-02-02 00:17:57

solution2 0 2012-02-02 00:31:08

solution3 0 2012-02-03 19:07:07

solution4 -1 2012-02-11 16:47:20

solution1
2 2012-02-02 00:17:57

solution2
0 2012-02-02 00:31:08

solution3
0 2012-02-03 19:07:07

solution4
-1 2012-02-11 16:47:20