简体   繁体   English

Java找到HTML标签

[英]Java find html tag

Hi I am trying to delete an HTML tag from a string. 嗨,我正在尝试从字符串中删除HTML标记。 The tag I am trying to delete is 我要删除的标签是

<td class="gutter"> text text </td>

I tried the following but nothing worked: 我尝试了以下操作,但没有任何效果:

String regex = "<td class=\"gutter\">([^<]*)</td>";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(htmlstring);
m.find() / m.matches()

But cant seem to find it at all... What am I doing wrong? 但是似乎根本找不到它……我在做什么错?

You can't use regular expressions to work with HTML (or XML). 您不能使用正则表达式来处理HTML(或XML)。 It is impossible to do it right (not "hard", but technically impossible). 不可能正确地做到这一点(不是“硬”的,但在技术上是不可能的)。 Use a HTML parser like Jsoup . 使用类似Jsoup的HTML解析器。 Then it is easy, just follow the docs. 然后很简单,只需遵循文档即可。

If you want to strip tags from HTML, use a library that does that. 如果要从HTML剥离标签,请使用可以执行此操作的库。 Don't roll your own HTML parser. 不要滚动自己的HTML解析器。

<plug shameless="true">

http://code.google.com/p/owasp-java-html-sanitizer/ http://code.google.com/p/owasp-java-html-sanitizer/

A fast and easy to configure HTML Sanitizer written in Java which lets you include HTML authored by third-parties in your web application while protecting against XSS. 一种快速,易于配置的用Java编写的HTML Sanitizer,它使您可以在Web应用程序中包含第三方编写的HTML,同时防止XSS。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM