简体   繁体   English

从中提取HTML <!— --> 使用jsoup java对结束标记进行注释

[英]Extract HTML from <!— --> comment to a closing tag using jsoup java

I have some HTML that looks like 我有一些看起来像的HTML

<!-- start content -->
<p>Blah...</p>
<dl><dd>blah</dd></dl>

I need to extract the HTML from the comment to a closing dl tag. 我需要从注释中提取HTML到结束的dl标签。 The closing dl is the first one after the comment (not sure if there could be more after, but never is one before). 结尾dl是注释之后的第一个(不确定后面是否可以有更多,但从来没有)。 The HTML between the two is variable in length and content and doesn't have any good identifiers. 两者之间的HTML的长度和内容是可变的,并且没有任何好的标识符。

I see that comments themselves can be selected using #comment nodes, but how would I get the HTML starting from a comment and ending with an HTML close tag as I've described? 我看到可以使用#comment节点来选择注释本身,但是如何从注释开始并以HTML close标签结束HTML,如我所描述的那样?

Here's what I've come up with, which works, but obviously not the most efficient. 这是我想出的,可以解决的问题,但显然不是最有效的。

    String myDirectoryPath = "D:\\Path";
    File dir = new File(myDirectoryPath);
    Document myDoc;
    Pattern p = Pattern.compile("<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>");
    for (File child : dir.listFiles()) {
        System.out.println(child.getAbsolutePath()); 
        File file = new File(child.getAbsolutePath());
        String charSet = "UTF-8";
        String innerHtml = Jsoup.parse(file,charSet).select("body").html();
        Matcher m = p.matcher(innerHtml);
        if (m.find()) {
            Document doc = Jsoup.parse(m.group(1)); 
            String myText = doc.text();
            try {
                PrintWriter out = new PrintWriter(new BufferedWriter(new FileWriter("D:\\Path\\combined.txt", true)));
                out.println(myText);
                out.close();
            } catch (IOException e) {
                //error                }
        }
    }

To use a regex, maybe something simple 要使用正则表达式,也许很简单

 #  "<!--\\s*start\\s*content\\s*-->([\\S\\s]*?)</\\s*dl\\s*>"

 <!-- \s* start \s* content \s* -->
 ([\S\s]*?) 
 </ \s* dl \s* >

Here's some example code - it may need further improvements - depending on what you want to do. 这是一些示例代码-可能需要进一步改进-取决于您要执行的操作。

final String html = "<p>abc</p>" // Additional tag before the comment
        + "<!-- start content -->\n"
        + "<p>Blah...</p>\n"
        + "<dl><dd>blah</dd></dl>"
        + "<p>def</p>"; // Additional tag after the comment

// Since it's not a full Html document (header / body), you may use a XmlParser
Document doc = Jsoup.parse(html, "", Parser.xmlParser());


for( Node node : doc.childNodes() ) // Iterate over all elements in the document
{
    if( node.nodeName().equals("#comment") ) // if it's a comment we do something
    {
        // Some output for testing ...
        System.out.println("=== Comment =======");
        System.out.println(node.toString().trim()); // 'toString().trim()' is only out beautify
        System.out.println("=== Childs ========");


        // Get the childs of the comment --> following nodes
        final List<Node> childNodes = node.siblingNodes();

        // Start- and endindex for the sublist - this is used to skip tags before the actual comment node
        final int startIdx = node.siblingIndex();   // Start index - start after (!) the comment node
        final int endIdx = childNodes.size();       // End index - the last following node

        // Iterate over all nodes, following after the comment
        for( Node child : childNodes.subList(startIdx, endIdx) )
        {
            /*
             * Do whatever you have to do with the nodes here ...
             * In this example, they are only used as Element's (Html Tags)
             */
            if( child instanceof Element )
            {
                Element element = (Element) child;

                /*
                 * Do something with your elements / nodes here ...
                 * 
                 * You can skip e.g. 'p'-tag by checking tagnames.
                 */
                System.out.println(element);

                // Stop after processing 'dl'-tag (= closing 'dl'-tag)
                if( element.tagName().equals("dl") )
                {
                    System.out.println("=== END ===========");
                    break;
                }
            }
        }
    }
}

For easier understanding, the code is very detailed, you can shorten it at some points. 为了更容易理解,该代码非常详细,您可以在某些时候将其缩短。

And finally, here's the output of this example: 最后,这是此示例的输出:

=== Comment =======
<!-- start content -->
=== Childs ========
<p>Blah...</p>
<dl>
 <dd>
  blah
 </dd>
</dl>
=== END ===========

Btw. 顺便说一句。 to get the text of the comment, just cast it to Comment : 要获取评论文本,只需将其投射到Comment

String commentText = ((Comment) node).getData();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM