简体   繁体   中英

Presence of HTML tags using Jsoup

With Jsoup it is easy to count number of times a particular tag's presence in a text. For example I am trying to see how many times anchor tag is present in the given text.

    String content = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>. <p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
    Document doc = Jsoup.parse(content);
    Elements links = doc.select("a[href]"); // a with href
    System.out.println(links.size());

This gives me a count of 4. If I have a sentence and I want to know if the sentence contains any html tags or not, is it possible with Jsoup? Thank you.

You are possibly better off with a regular expression, but if you really want to use JSoup, then you can try to match for all ellements, and then subtract 4, as JSoup automatically adds four elements, that is, first the root element, and then a <html> , <head> and <body> element.

This might loosely look like:

// attempt to count html elements in string - incorrect code, see below 
public static int countHtmlElements(String content) {
    Document doc = Jsoup.parse(content);
    Elements elements = doc.select("*");
    return elements.size()-4;
}

However this gives a wrong result if the text contains a <html> , <head> or <body> ; compare the results of:

// gives a correct count of 2 html elements
System.out.println(countHtmlElements("some <b>text</b> with <i>markup</i>"));
// incorrectly counts 0 elements, as the body is subtracted 
System.out.println(countHtmlElements("<body>this gives a wrong result</body>"));

So to make this work, you would have to check for the "magic" tags separately; that is why I feel a regular expression might be simpler.

More failed attempts to make this work: Using parseBodyFragment instead of parse does not help, as this gets sanitized in the same way by JSoup. Same, counting as doc.select("body *"); saves you the trouble to subtract 4, but it still yields the wrong count if a <body> is involved. Only if you have an application where you are sure that no <html> , <head> or <body> elements are present in the strings to be checked, it might work under that limitiation.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM