I have such tag in my HTML:
<p class="outter">
<strong class="inner">not needed message</strong>
NEEDED MESSAGE
</p>
I'm trying to extract "NEEDED MESSAGE"
but if I do something like this:
String results = document.select("p.outter").text();
System.out.println(results);
it prints :
not needed messageNEEDED MESSAGE
So the question is:
How can I get the text for a specific tag without the text from its inner tags?
One solution could be to select only the TextNode
elements. Find below a small snippet.
String html = "<p class=\"outter\">\n"
+ " <strong class=\"inner\">not needed message</strong>\n"
+ " NEEDED MESSAGE\n"
+ "</p>";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("p.outter");
for (Element element : elements) {
// as mentioned by luksch
System.out.println("ownText = " + element.ownText());
// or manually based on the node type
for (Node node : element.childNodes()) {
if (node instanceof TextNode) {
System.out.println("node = " + node);
}
}
}
output
node =
node = NEEDED MESSAGE
So you need to filter the output based on your requirement. Eg skip empty ones.
You can use ownText()
after selecting the paragraph. Example
package com.stackoverflow.answer;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import org.jsoup.nodes.Element;
public class HtmlParserExample {
public static void main(String[] args) {
String html = "<p class=\"outter\"><strong class=\"inner\">not needed message</strong>NEEDED MESSAGE</p>";
Document doc = Jsoup.parse(html);
Elements paragraphs = doc.select("p");
for (Element p : paragraphs)
System.out.println(p.ownText());
}
}
Use Jsoup's ownText () method:
String results = document.select("p.outter").ownText();
System.out.println(results);
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.