如何使用Java从我的HTML中删除对象标签

Question

Hi I am trying to remove the object tag from my HTML content using Java so that I can render the HTML in devices which do not support Flash 嗨，我正在尝试使用Java从HTML内容中删除对象标签，以便可以在不支持Flash的设备中呈现HTML。

<object classid="clsid:F08DF954-8592-11D1-B16A-00C0F0283628" id="Slider1" width="100" height="50">
  <param name="BorderStyle" value="1" />
  <param name="MousePointer" value="0" />
  <param name="Enabled" value="1" />
  <param name="Min" value="0" />
  <param name="Max" value="10" />
</object>

Answer 1

这个正则表达式必须达到目的：

<\/?object(\s\w+(\=\".*\")?)*\>

Answer 2

You could just use Tagsoup (http://ccil.org/~cowan/XML/tagsoup/), which is an xml parser which can read from html, even if badly formatted (doesn't need to be xhtml or even conform). 您可以只使用Tagsoup（http://ccil.org/~cowan/XML/tagsoup/），这是一个XML解析器，即使格式不正确也可以从html读取（不需要是xhtml甚至不需要遵循）。

You then can just remove all object tags using xpath. 然后，您可以只使用xpath删除所有对象标签。

This is much safer than a regex, which is difficult to maintain if you want to master all edge cases. 这比正则表达式安全得多，如果您想掌握所有边缘情况，则很难维护。

Answer 3

The OBJECT HTML element may be nested. OBJECT HTML元素可以嵌套。 Since Java does not provide a native regex recursive expression, you cannot directly match an outermost OBJECT element with a single regex. 由于Java不提供本地正则表达式递归表达式，因此您不能直接将最外部的OBJECT元素与单个正则表达式匹配。 You can , however, craft a regex to match an innermost OBJECT element, and iterate, replacing them from the "inside-out" until there are none left. 但是，您可以设计正则表达式以匹配最里面的 OBJECT元素，然后进行迭代，从“由内而外”替换它们，直到没有剩余为止。 Here is a tested Java snippet which does precisely that: 这是一个经过测试的Java代码段，它精确地做到了：

String regex = "<object\\b[^>]*>[^<]*(?:(?!</?object\\b)<[^<]*)*</object\\s*>";
String resultString = null;
java.util.regex.Pattern p = java.util.regex.Pattern.compile(
            regex,
            java.util.regex.Pattern.CASE_INSENSITIVE |
            java.util.regex.Pattern.UNICODE_CASE);
java.util.regex.Matcher m = p.matcher(subjectString);
while (m.find())
{ // Iterate until there are no OBJECT elements.
    resultString = m.replaceAll("");
    m = p.matcher(resultString);
}
System.out.println(resultString);

CAVEATS: As many will undoubtedly point out: "You can't parse HTML with regex!" 洞穴：无疑会有很多人指出： “您不能使用正则表达式来解析HTML！” And they are correct (if your solution must work reliably 100% of the time). 并且它们是正确的（如果您的解决方案必须在100％的时间内可靠地工作）。 Although the solution above will work for a lot of cases, be aware that it has some limitations and there are certain things which can trip it up, namely: 尽管上面的解决方案在很多情况下都适用，但是请注意，它有一些局限性，并且有某些因素可以使它崩溃：

An "<OBJECT...>" start or "</OBJECT>" end tag may not appear in any CDATA strings such as in SCRIPT or STYLE tags, or within any tag attribute, or within any HTML comment. "<OBJECT...>"开始或"</OBJECT>"结束标记可能不会出现在任何CDATA字符串中，例如SCRIPT或STYLE标记中，任何标记属性中或任何HTML注释中。 eg <p title="evil <OBJECT> attribute"> or <SCRIPT>alert("Bad <OBJECT> script here!");</SCRIPT> , or  . 例如<p title="evil <OBJECT> attribute">或<SCRIPT>alert("Bad <OBJECT> script here!");</SCRIPT>或 。
The <OBJECT> start tag may not contain any angle brackets in its attributes. <OBJECT>开始标记的属性中不得包含任何尖括号。

These special cases should be pretty rare and the code above should work just fine for most (if not all) HTML files you have lying around. 这些特殊情况应该很少见，上面的代码对于大多数（即使不是全部）HTML文件也可以正常工作。

如何使用Java从我的HTML中删除对象标签

问题描述

3 个解决方案

解决方案1
0 已采纳 2011-03-22 09:45:48

解决方案2
0 2011-03-22 20:58:26

解决方案3
0 2011-03-23 01:07:41

如何使用Java从我的HTML中删除对象标签

问题描述

3 个解决方案

解决方案1 0 已采纳 2011-03-22 09:45:48

解决方案2 0 2011-03-22 20:58:26

解决方案3 0 2011-03-23 01:07:41

解决方案1
0 已采纳 2011-03-22 09:45:48

解决方案2
0 2011-03-22 20:58:26

解决方案3
0 2011-03-23 01:07:41