简体   繁体   中英

How to remove MS Word unnecessary html tags using java regex

I have a WYSIWYG editor and sometimes users cut and past into it from MS Word. In my server side java I am trying to remove unnecessary html from the pasted html such as:

<o:p>

Should be:

<p>

The patterns I am trying to remove are:

  //Remove:
  // unnecessary tag spans (comments and title)
  //   <!--(w|W)+?-->
  //   <title>(w|W)+?</title>
  //classes and styles
  //    s?class=w+
  //    s+style='[^']+'
  //unnecessary tags
  //    <(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>
  //empty paragraph tags
  //    (<[^>]+>)+&nbsp;(</w+>)+
  //bizarre v: element attached to <img> tag
  //    s+v:w+=""[^""]+""

My code is:

  Pattern p = Pattern.compile("<!--(w|W)+?-->?|<title>(w|W)+?</title>?|s+style='[^']+'?|"
        + "<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>?|"
        + "(<[^>]+>)+&nbsp;(</w+>)+?", Pattern.CASE_INSENSITIVE);
  Matcher m = p.matcher(html);
  String result = m.replaceAll("");

I get the error:

java.util.regex.PatternSyntaxException: Unclosed character class near index 163
<!--(w|W)+?-->?|<title>(w|W)+?</title>?|s+style='[^']+'?|<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>?|(<[^>]+>)+&nbsp;(</w+>)+?

Can someone please advise me on the correct syntax please.

Wiktor has provided an excellent answer; however the colour style is removed and I would like to keep that if possible.

Before clean:

notClean: <p class="MsoNormal"><b><span lang="EN-AU" style="font-size:11.0pt;font-family:&quot;Verdana&quot;,sans-serif;color:#006600">Special
Interest Area badges youth members can achieve, supported by Queensland
Environmental Education Team:<o:p></o:p></span></b></p><p class="MsoNormal"><b><span lang="EN-AU">&nbsp;</span></b></p><p class="MsoNormal"><b><span lang="EN-AU">&nbsp;</span></b></p><p>

</p><p class="MsoNormal"><b><i><span lang="EN-AU" style="font-size:11.0pt;font-family:&quot;Verdana&quot;,sans-serif">Joey Scout SIA Badges
(2 hours each badge)</span></i></b><b><span lang="EN-AU" style="font-size:11.0pt;font-family:&quot;Verdana&quot;,sans-serif"><o:p></o:p></span></b></p>

After clean:

cleaned: <p class="MsoNormal"><b>Special
Interest Area badges youth members can achieve, supported by Queensland
Environmental Education Team:<p>

</p><p class="MsoNormal"><b><i>Joey Scout SIA Badges
(2 hours each badge)</i></b><b></b></p>

I tried:

Pattern p = Pattern.compile("<!--.*?-->|<title>.*?</title>|"
            + "<(meta|link|/?o:|/?div|/?std|/?head|/?html|/?body|/?span|!\\[)[^>]*>|"
            + "(<[^>]+>)+&nbsp;(</\\w+>)+", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);

However, the style is still removed.

I had to leave the "span" in as well.

You can use

String html = "Cleaned!<!-- \nsome comment --><title> my title</title> style='OUR_STYLE'<meta ...>";
Pattern p = Pattern.compile("<!--.*?-->|<title>.*?</title>|\\s+style='[^']+'|"
        + "<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|/?body|/?span|!\\[)[^>]*>|"
        + "(<[^>]+>)+&nbsp;(</\\w+>)+", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m = p.matcher(html);
String result = m.replaceAll("");
System.out.println(result);
// => Cleaned!

See the Java demo .

NOTES :

  • Pattern.DOTALL makes . matcb any chars including line break chars (so no need to use a workaround like [\\w\\W] )
  • Do not forget escaping backslashes in regex escapes, like \\s or \\w (in a Java string literal, "\\\\s" or "\\\\w" )
  • Do not forget to escape special regex metacharacters, like [ or ( , see What special characters must be escaped in regular expressions?
  • If a char must be present in the string, do not put ? after it (as is the case with > in your pattern), it makes the char optional.

I believe you need to escape the special characters <([{\\^-=$!|]})?*+.>

Here is a link with more info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM