簡體   English   中英

如何過濾JSoup中嵌套標簽中的噪聲? 爪哇

[英]How to filter noise in nested tags in JSoup? java

如何過濾嵌套標簽中的噪聲? 例如,我有此輸入:

[在:]

<html>
  <source>
     <noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo
  </source>
</html>

我需要得到以下輸出:

[出]

foo bar bar
baring foo

我已經嘗試過了,但是我仍然從嵌套標簽中得到噪音:

import java.io.*;
import java.util.List;

import org.apache.commons.io.IOUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.parser.Parser;

public class HelloJsoup {
    public static void main(String[] args) throws IOException {

        String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
        Document doc = Jsoup.parse(br, "", Parser.xmlParser());
        //System.out.println(doc);
        for (Element sentence : doc.getElementsByTag("source"))
            System.out.print(sentence.text());

    }
}

[出:]

something something, many many thingsfoo bar barmore something something noisebaring foo

首先刪除噪聲標簽,然后剩下<source>foo bar barbaring foo</source> ,盡管要獲得指定的輸出,您可以遍歷節點並在新行上打印每個TextNode。 例如:

String br = "<html><source><noise>something something, many many things</noise>foo bar bar<noise>more something something noise</noise>baring foo</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());

Element source = doc.select("source").first(); // select source element

Elements noise = doc.select("noise");          // Select noise elements
for (Element e : noise) {                      // loop through and remove each from doc
    e.remove();
}

for (Node node : source.childNodes()) {
    System.out.println(node);                  // print each remaining textnode on a new line
}

輸出:

foo bar bar
baring foo

更新資料

我發現這是一個更簡單的方法:

Element source = doc.select("source").first(); // select source element

for (TextNode node : source.textNodes()) {
    System.out.println(node);
}

它遍歷<source>元素直接擁有的textNodes並將每個節點打印到新行。 Ouput是:

foo bar bar
baring foo

嘗試:

System.out.println(sentence.ownText());

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM