Get all nodes with HTMLParser in java

Question

I need to get all the elements of an HTML file, because I have to represent them on a tree. The problem is that I only can obtain the first node, the html node.

I am programming in Java with the HTMLParser Libraries.

My code is:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

class Principal
{
    public static void main (String[] args) {
        try {
            Parser parser = new Parser("http://www.marca.com");
            NodeList list = parser.parse(null);
            for (int i = 0; i < list.size(); i++) {
                Node node = list.elementAt(i);
                System.out.println(node.getText());
            }
        } catch (ParserException pe) {
            pe.printStackTrace ();
        }
    }
}

I tryed with an iterator, but the result was the same.

The execution of the code produces the following result:

!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"


html xmlns="http://www.w3.org/1999/xhtml"

Does anyone know how I can get all the elements of the HTML file?

Answer 1

A Tree has different levels. On your approch you are just selecting the nodes in the top level. In order to print all nodes you need to go all the childnodes.

Answer 2

I think you shoud using jsoup Example:

Document doc = Jsoup.connect("http://www.marca.com").get();
Elements allNodes = doc.getAllElements()

You can reference here: http://jsoup.org/

Answer 3

Trying differents methods I solve the problem with a recursive call to iterate the different chilren of the tree.

Thanks for your help

Get all nodes with HTMLParser in java

Question

3 answers

solution1
1 2013-10-27 22:18:56

solution2
0 2013-10-28 15:39:18

solution3
0 ACCPTED 2013-10-28 16:55:32

Get all nodes with HTMLParser in java

Question

3 answers

solution1 1 2013-10-27 22:18:56

solution2 0 2013-10-28 15:39:18

solution3 0 ACCPTED 2013-10-28 16:55:32

solution1
1 2013-10-27 22:18:56

solution2
0 2013-10-28 15:39:18

solution3
0 ACCPTED 2013-10-28 16:55:32