简体   繁体   English

Jsoup解析器删除带有'<'和'>'的单词

[英]Jsoup parser remove words with '<' and '>'

I'm using the Jsoup.parse() to remove html tags from a String. 我正在使用Jsoup.parse()从字符串中删除html标签。 But my string as a word like <name> also. 但是我的字符串也像<name>这样的单词。

The problem is Jsoup.parse() remove that too. 问题是Jsoup.parse()也将其删除。 I'ts because that text has < and >. 我不是因为该文本具有<和>。 I can't just remove < and > from the text too. 我也不能只从文本中删除<和>。 How can I do this. 我怎样才能做到这一点。

String s1 = Jsoup.parse("<p>Hello World</p>").text();
//s1 is "Hello World". Correct

String s2 = Jsoup.parse("<name>").text();
//s2 is "". But it should be <name> because <name> is not a html tag

I'm using the Jsoup.parse() to remove html tags from a String. 我正在使用Jsoup.parse()从字符串中删除html标签。

You want to use the Jsoup#clean method. 您要使用Jsoup#clean方法。 You'll also need a little manual work after because Jsoup will still see <name> as an HTML tag. 之后,您还需要进行一些手动操作,因为Jsoup仍将<name>视为HTML标记。

// Define the list of words to preserve...
String[] myExceptions = new String[] { "name" }; 
int nbExceptions = myExceptions.length;

// Build a whitelist for Jsoup...
Whitelist myWhiteList = Whitelist.simpleText().addTags(myExceptions);

// Let Jsoup remove any html tags...
String s2 = Jsoup.clean("<name>", myWhiteList);

// Complete the initial html tags removal...
for (int i = 0; i < nbExceptions; i++) {
    s2 = s2.replaceAll("<" + myExceptions[i] + ">.+?</" + myExceptions[i] + ">", "<" + myExceptions[i] + ">");
}

System.out.println(">>" + s2);

OUTPUT 输出值

>><name>

References 参考文献

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM