使用Jsoup從html文件中提取標簽

Question

我正在對Web文檔進行結構分析。 為此，我只需要提取Web文檔的結構（僅標記）。 我找到了一個名為Jsoup的Java html解析器。 但是我不知道如何使用它來提取標簽。

例：

<html>
 <head>
    this is head
 </head>
 <body>
    this is body
 </body>
</html>

輸出：

html,head,head,body,body,html

Answer 1

聽起來像是深度優先遍歷：

public class JsoupDepthFirst {

    private static String htmlTags(Document doc) {
        StringBuilder sb = new StringBuilder();
        htmlTags(doc.children(), sb);
        return sb.toString();
    }

    private static void htmlTags(Elements elements, StringBuilder sb) {
        for(Element el:elements) {
            if(sb.length() > 0){
                sb.append(",");
            }
            sb.append(el.nodeName());
            htmlTags(el.children(), sb);
            sb.append(",").append(el.nodeName());
        }
    }

    public static void main(String... args){
        String s = "<html><head>this is head </head><body>this is body</body></html>";
        Document doc = Jsoup.parse(s);
        System.out.println(htmlTags(doc));
    }
}

另一個解決方案是使用jsoup NodeVisitor，如下所示：

   SecondSolution ss = new SecondSolution();
   doc.traverse(ss);
   System.out.println(ss.sb.toString());

類：

  public static class SecondSolution implements NodeVisitor {

        StringBuilder sb = new StringBuilder();

        @Override
        public void head(Node node, int depth) {
            if (node instanceof Element && !(node instanceof Document)) {
                if (sb.length() > 0) {
                    sb.append(",");
                }
                sb.append(node.nodeName());
            }
        }

        @Override
        public void tail(Node node, int depth) {
            if (node instanceof Element && !(node instanceof Document)) {
                sb.append(",").append(node.nodeName());
            }
        }
    }

使用Jsoup從html文件中提取標簽

問題描述

1 個解決方案

解決方案1
2 已采納 2014-09-19 08:42:15

使用Jsoup從html文件中提取標簽

問題描述

1 個解決方案

解決方案1 2 已采納 2014-09-19 08:42:15

解決方案1
2 已采納 2014-09-19 08:42:15