如何过滤JSoup中嵌套标签中的噪声？爪哇

Question

How to filter noise in nested tags? 如何过滤嵌套标签中的噪声？ For example, i have this input: 例如，我有此输入：

[in:] [在：]

<html>
  <source>
     <noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo
  </source>
</html>

and i need to get this output: 我需要得到以下输出：

[out] [出]

foo bar bar
baring foo

I have tried this but I am still getting the noise from the nested tags: 我已经尝试过了，但是我仍然从嵌套标签中得到噪音：

import java.io.*;
import java.util.List;

import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;

public class HelloJsoup {
    public static void main(String[] args) throws IOException {

        String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
        Document doc = Jsoup.parse(br, "", Parser.xmlParser());
        //System.out.println(doc);
        for (Element sentence : doc.getElementsByTag("source"))
            System.out.print(sentence.text());

    }
}

[out:] [出：]

something something, many many thingsfoo bar barmore something something noisebaring foo

Answer 1

By removing the noise tags first, you are left with <source>foo bar barbaring foo</source> , though to achieve the output you specified, you can just iterate through the nodes and print each TextNode on a new line. 首先删除噪声标签，然后剩下<source>foo bar barbaring foo</source> ，尽管要获得指定的输出，您可以遍历节点并在新行上打印每个TextNode。 For example: 例如：

String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());

Element source = doc.select("source").first(); // select source element

Elements noise = doc.select("noise");          // Select noise elements
for (Element e : noise) {                      // loop through and remove each from doc
    e.remove();
}

for (Node node : source.childNodes()) {
    System.out.println(node);                  // print each remaining textnode on a new line
}

Outputs: 输出：

foo bar bar
baring foo

Update 更新资料

I found this to be an even simpler method: 我发现这是一个更简单的方法：

Element source = doc.select("source").first(); // select source element

for (TextNode node : source.textNodes()) {
    System.out.println(node);
}

It iterates through the textNodes owned directly by the <source> element and prints each one to a new line. 它遍历<source>元素直接拥有的textNodes并将每个节点打印到新行。 Ouput is: Ouput是：

foo bar bar
baring foo

Answer 2

尝试：

System.out.println(sentence.ownText());

如何过滤JSoup中嵌套标签中的噪声？爪哇

问题描述

2 个解决方案

解决方案1
4 已采纳 2014-02-10 14:00:13

解决方案2
0 2014-02-10 13:33:12

如何过滤JSoup中嵌套标签中的噪声？ 爪哇

问题描述

2 个解决方案

解决方案1 4 已采纳 2014-02-10 14:00:13

解决方案2 0 2014-02-10 13:33:12

如何过滤JSoup中嵌套标签中的噪声？爪哇

解决方案1
4 已采纳 2014-02-10 14:00:13

解决方案2
0 2014-02-10 13:33:12