简体   繁体   中英

Get all nodes with HTMLParser in java

I need to get all the elements of an HTML file, because I have to represent them on a tree. The problem is that I only can obtain the first node, the html node.

I am programming in Java with the HTMLParser Libraries.

My code is:

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

class Principal
{
    public static void main (String[] args) {
        try {
            Parser parser = new Parser("http://www.marca.com");
            NodeList list = parser.parse(null);
            for (int i = 0; i < list.size(); i++) {
                Node node = list.elementAt(i);
                System.out.println(node.getText());
            }
        } catch (ParserException pe) {
            pe.printStackTrace ();
        }
    }
}

I tryed with an iterator, but the result was the same.

The execution of the code produces the following result:

!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"


html xmlns="http://www.w3.org/1999/xhtml"

Does anyone know how I can get all the elements of the HTML file?

A Tree has different levels. On your approch you are just selecting the nodes in the top level. In order to print all nodes you need to go all the childnodes.

I think you shoud using jsoup Example:

Document doc = Jsoup.connect("http://www.marca.com").get();
Elements allNodes = doc.getAllElements()

You can reference here: http://jsoup.org/

Trying differents methods I solve the problem with a recursive call to iterate the different chilren of the tree.

Thanks for your help

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM