[英]Java Jsoup : Extract all the text
I have the follwing code. 我有下面的代码。 The doc.body.text() statement doesn't output the text within the style and the script tags.
doc.body.text()语句不输出style和script标签内的文本。 I read the .text() function code , and it looks for all instances of TextNode.
我阅读了.text()函数代码,它查找TextNode的所有实例。 What is a TextNode in Jsoup.
什么是Jsoup中的TextNode。
And why is the script text not included in the .text() output. 以及为什么脚本文本未包含在.text()输出中。
String contex = "<html><body><style>style</style><div>div</div><script>script</script><p>paragraph</p>body</body></html>";
Document doc = Jsoup.parse(contex, "UTF-8");
String text = doc.body().text();
System.out.println("Test text : " + text);
Output : paragraphbody 输出:段落正文
For this you need to use org.jsoup.select.Elements
to parse the tags like <script>
. 为此,您需要使用
org.jsoup.select.Elements
来解析<script>
类的标签。
String contex = "<html><body><style>style</style><div>div</div><script>scripts</script><p>paragraph</p><p>body</p><script>787878</script></body></html>";
Document doc =Jsoup.parse(contex, "UTF-8");
Elements scriptElements = doc.getElementsByTag("script");
for (Element el :scriptElements ){
for (DataNode dn : el.dataNodes()) {
System.out.println(dn.getWholeData());
}
}
OP: OP:
scripts
787878
And why is the script text not included in the .text() output.
以及为什么脚本文本未包含在.text()输出中。
Because script
and style
has data , not the text . 因为
script
和style
具有数据 ,而不是文本 。
To get data from script
's data, use getElementsByTag
要从
script
的数据中获取数据,请使用getElementsByTag
Elements scriptElements = doc.getElementsByTag("script");
and access by getWholeData
并通过
getWholeData
访问
for (Element element :scriptElements ){
for (DataNode node : element.dataNodes()) {
System.out.println(node.getWholeData());
}
System.out.println("-------------------");
}
As per source code , for style
or script
tag is treated as dataNode instead of textNode 根据源代码 ,对于
style
或script
标记,将其视为dataNode而不是textNode
void insert(Token.Character characterToken) { Node node; // characters in script and style go in as datanodes, not text nodes final String tagName = currentElement().tagName(); final String data = characterToken.getData(); if (characterToken.isCData()) node = new CDataNode(data); else if (tagName.equals("script") || tagName.equals("style")) node = new DataNode(data); else node = new TextNode(data); currentElement().appendChild(node); // doesn't use insertNode, because we don't foster these; and will always have a stack. }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.