如何删除字符串中HTML标记中的所有HTML属性

Question

I am trying to take a string that has HTML, strip out some tags (img, object) and all other HTML tags, strip out their attributes. 我试图获取一个包含HTML的字符串，删除一些标签（img，object）和所有其他HTML标签，去掉它们的属性。 For example: 例如：

<div id="someId" style="color: #000000">
   <p class="someClass">Some Text</p>
   <img src="images/someimage.jpg" alt="" />
   <a href="somelink.html">Some Link Text</a>
</div>

Would become: 会成为：

<div>
   <p>Some Text</p>
   Some Link Text
</div>

I am trying: 我在尝试：

string.replaceAll("<\/?[img|object](\s\w+(\=\".*\")?)*\>", ""); //REMOVE img/object

I am not sure how to strip all attributes inside a tag though. 我不知道如何剥离标签内的所有属性。

Any help would be appreciated. 任何帮助，将不胜感激。

Thanks. 谢谢。

Answer 1

I would not recommend regex for this if you want to filter specific tags. 如果您想过滤特定标签，我不建议使用正则表达式。 This is going to be hell of a job and never going to be fully reliable. 这将是一项艰巨的工作，永远不会完全可靠。 Use a normal HTML parser like Jsoup . 使用像Jsoup这样的普通HTML解析器。 It offers the Whitelist API to clean up HTML. 它提供了Whitelist API来清理HTML。 See also this cookbook document . 另见本食谱文件。

Here's a kickoff example with help of Jsoup which only allows <div> and <p> tags next to the standard set of tags of the chosen Whitelist which is Whitelist#simpleText() in the below example. 这是Jsoup帮助下的启动示例，它只允许选择的Whitelist的标准标签集旁边的<div>和<p>标签，在下面的例子中是Whitelist#simpleText() 。

String html = "<div id='someId' style='color: #000000'><p class='someClass'>Some Text</p><img src='images/someimage.jpg' alt='' /><a href='somelink.html'>Some Link Text</a></div>";
Whitelist whitelist = Whitelist.simpleText(); // Whitelist.simpleText() allows b, em, i, strong, u. Use Whitelist.none() instead if you want to start clean.
whitelist.addTags("div", "p");
String clean = Jsoup.clean(html, whitelist);
System.out.println(clean);

This results in 这导致了

<div>
   <p>Some Text</p>Some Link Text
</div>

See also: 也可以看看：

How to implement a possibility for user to post some html-formatted data in a safe way? 如何实现用户以安全的方式发布一些html格式的数据的可能性？

Answer 2

You can remove all attributes like this: 您可以删除所有属性，如下所示：

string.replaceAll("(<\\w+)[^>]*(>)", "$1$2");

This expression matches an opening tag, but captures only its header <div and the closing > as groups 1 and 2. replaceAll uses references to these groups to join them back in the output as $1$2 . 此表达式与开始标记匹配，但仅捕获其标题<div和结束>作为组1和2. replaceAll使用对这些组的引用将它们作为$1$2连接到输出中。 This cuts out the attributes in the middle of the tag. 这会删除标记中间的属性。

Answer 3

/<(/?\\w+) .*?>/<\\1>/可能有效 - 获取标记（匹配组）并读取任何属性，直到关闭括号，并将其替换为仅支持和标记。

Answer 4

如果您使用SAX或DOM，并且获取节点名称和值，并删除所有属性，可能会容易得多。

如何删除字符串中HTML标记中的所有HTML属性

问题描述

4 个解决方案

解决方案1
8 2012-02-23 18:09:00

See also: 也可以看看：

解决方案2
7 已采纳 2012-02-23 15:28:35

解决方案3
1 2012-02-23 15:26:23

解决方案4
-1 2012-02-23 15:26:07

如何删除字符串中HTML标记中的所有HTML属性

问题描述

4 个解决方案

解决方案1 8 2012-02-23 18:09:00

See also: 也可以看看：

解决方案2 7 已采纳 2012-02-23 15:28:35

解决方案3 1 2012-02-23 15:26:23

解决方案4 -1 2012-02-23 15:26:07

解决方案1
8 2012-02-23 18:09:00

解决方案2
7 已采纳 2012-02-23 15:28:35

解决方案3
1 2012-02-23 15:26:23

解决方案4
-1 2012-02-23 15:26:07