简体   繁体   中英

Extract HTML from <!— --> comment to a closing tag using jsoup java

I have some HTML that looks like

<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>

I need to extract the HTML from the comment to a closing dl tag. The closing dl is the first one after the comment (not sure if there could be more after, but never is one before). The HTML between the two is variable in length and content and doesn't have any good identifiers.

I see that comments themselves can be selected using #comment nodes, but how would I get the HTML starting from a comment and ending with an HTML close tag as I've described?

Here's what I've come up with, which works, but obviously not the most efficient.

    String myDirectoryPath = "D:\\Path";
    File dir = new File(myDirectoryPath);
    Document myDoc;
    Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
    for (File child : dir.listFiles()) {
        System.out.println(child.getAbsolutePath()); 
        File file = new File(child.getAbsolutePath());
        String charSet = "UTF-8";
        String innerHtml = Jsoup.parse(file,charSet).select("body").html();
        Matcher m = p.matcher(innerHtml);
        if (m.find()) {
            Document doc = Jsoup.parse(m.group(1)); 
            String myText = doc.text();
            try {
                PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
                out.println(myText);
                out.close();
            } catch (IOException e) {
                //error                }
        }
    }

To use a regex, maybe something simple

 #  "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"

 <!-- \s* start \s* content \s* -->
 ([\S\s]*?) 
 </ \s* dl \s* >

Here's some example code - it may need further improvements - depending on what you want to do.

final String html = "<p>abc</p>" // Additional tag before the comment
        + "<!-- start content -->\n"
        + "<p>Blah...</p>\n"
        + "<dl><dd>blah</dd></dl>"
        + "<p>def</p>"; // Additional tag after the comment

// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());


for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
    if( node.nodeName().equals("#comment") ) // if it's a comment we do something
    {
        // Some output for testing ...
        System.out.println("=== Comment =======");
        System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
        System.out.println("=== Childs ========");


        // Get the childs of the comment --> following nodes
        final List<Node> childNodes = node.siblingNodes();

        // Start- and endindex for the sublist - this is used to skip tags before the actual comment node
        final int startIdx = node.siblingIndex();   // Start index - start after (!) the comment node
        final int endIdx = childNodes.size();       // End index - the last following node

        // Iterate over all nodes, following after the comment
        for( Node child : childNodes.subList(startIdx, endIdx) )
        {
            /*
             * Do whatever you have to do with the nodes here ...
             * In this example, they are only used as Element's (Html Tags)
             */
            if( child instanceof Element )
            {
                Element element = (Element) child;

                /*
                 * Do something with your elements / nodes here ...
                 * 
                 * You can skip e.g. 'p'-tag by checking tagnames.
                 */
                System.out.println(element);

                // Stop after processing 'dl'-tag (= closing 'dl'-tag)
                if( element.tagName().equals("dl") )
                {
                    System.out.println("=== END ===========");
                    break;
                }
            }
        }
    }
}

For easier understanding, the code is very detailed, you can shorten it at some points.

And finally, here's the output of this example:

=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
 <dd>
  blah
 </dd>
</dl>
=== END ===========

Btw. to get the text of the comment, just cast it to Comment :

String commentText = ((Comment) node).getData();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM