[英]How to filter noise in nested tags in JSoup? java
How to filter noise in nested tags? 如何过滤嵌套标签中的噪声? For example, i have this input:
例如,我有此输入:
[in:] [在:]
<html>
<source>
<noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo
</source>
</html>
and i need to get this output: 我需要得到以下输出:
[out] [出]
foo bar bar
baring foo
I have tried this but I am still getting the noise from the nested tags: 我已经尝试过了,但是我仍然从嵌套标签中得到噪音:
import java.io.*;
import java.util.List;
import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;
public class HelloJsoup {
public static void main(String[] args) throws IOException {
String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
//System.out.println(doc);
for (Element sentence : doc.getElementsByTag("source"))
System.out.print(sentence.text());
}
}
[out:] [出:]
something something, many many thingsfoo bar barmore something something noisebaring foo
By removing the noise tags first, you are left with <source>foo bar barbaring foo</source>
, though to achieve the output you specified, you can just iterate through the nodes and print each TextNode on a new line. 首先删除噪声标签,然后剩下
<source>foo bar barbaring foo</source>
,尽管要获得指定的输出,您可以遍历节点并在新行上打印每个TextNode。 For example: 例如:
String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
Element source = doc.select("source").first(); // select source element
Elements noise = doc.select("noise"); // Select noise elements
for (Element e : noise) { // loop through and remove each from doc
e.remove();
}
for (Node node : source.childNodes()) {
System.out.println(node); // print each remaining textnode on a new line
}
Outputs: 输出:
foo bar bar
baring foo
Update 更新资料
I found this to be an even simpler method: 我发现这是一个更简单的方法:
Element source = doc.select("source").first(); // select source element
for (TextNode node : source.textNodes()) {
System.out.println(node);
}
It iterates through the textNodes owned directly by the <source>
element and prints each one to a new line. 它遍历
<source>
元素直接拥有的textNodes并将每个节点打印到新行。 Ouput is: Ouput是:
foo bar bar
baring foo
尝试:
System.out.println(sentence.ownText());
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.