简体   繁体   English

如何删除Java中的HTML标签?

[英]How can I remove HTML tags in Java?

I need to remove the HTML tags from the following string in java 我需要从Java中以下字符串中删除HTML标签

String text = "<html><head></head><body>hi x>a and y<b and z>c</body></html>";

I can do this with regular expressions. 我可以使用正则表达式来做到这一点。 But it also removes the "b and z" in the string. 但是它也会删除字符串中的“ b和z”。 Because it is consider this as tag. 因为它被认为是标签。

Of course it will remove "b and z". 当然,它将删除“ b和z”。 It is supposed to remove that text. 应该删除该文本。 Because in HTML attributes do not have to be quoted and they do not need values. 因为在HTML中,属性不必用引号引起来,并且它们不需要值。 So b is an element and and and z are attributes (without values). 所以b是元素,而andz是属性(无值)。 That is what an HTML parser would recognize. 这就是HTML解析器将识别的内容。

Of course, and and z and not really acceptable attributes for the b element, but in terms of syntactic well-formedness you should recognize the b as an element. 当然, andz并不是b元素的真正可接受属性,但是就语法格式正确而言,您应该将b视为元素。

If you did not want that removed, you need to write your < as &lt; 如果您不希望将其删除,则需要将<编写为&lt; . That is how to write correct HTML anyway. 无论如何,这就是如何编写正确的HTML。 :) :)

ADDENDUM 附录

(Yes I am aware of the famous "can't parse HTML with a regex" answer cited above in the comment, but the < vs &lt; in the question was worth pointing out in an answer, IMHO.) (是的,我知道上面在评论中引用的著名的“不能用正则表达式解析HTML”答案,但是问题中的< vs &lt;值得在答案中指出,恕我直言。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM