I have a WYSIWYG editor and sometimes users cut and past into it from MS Word. In my server side java I am trying to remove unnecessary html from the pasted html such as:
<o:p>
Should be:
<p>
The patterns I am trying to remove are:
//Remove:
// unnecessary tag spans (comments and title)
// <!--(w|W)+?-->
// <title>(w|W)+?</title>
//classes and styles
// s?class=w+
// s+style='[^']+'
//unnecessary tags
// <(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>
//empty paragraph tags
// (<[^>]+>)+ (</w+>)+
//bizarre v: element attached to <img> tag
// s+v:w+=""[^""]+""
My code is:
Pattern p = Pattern.compile("<!--(w|W)+?-->?|<title>(w|W)+?</title>?|s+style='[^']+'?|"
+ "<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>?|"
+ "(<[^>]+>)+ (</w+>)+?", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(html);
String result = m.replaceAll("");
I get the error:
java.util.regex.PatternSyntaxException: Unclosed character class near index 163
<!--(w|W)+?-->?|<title>(w|W)+?</title>?|s+style='[^']+'?|<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|body|/?body|/?span|![)[^>]*?>?|(<[^>]+>)+ (</w+>)+?
Can someone please advise me on the correct syntax please.
Wiktor has provided an excellent answer; however the colour style is removed and I would like to keep that if possible.
Before clean:
notClean: <p class="MsoNormal"><b><span lang="EN-AU" style="font-size:11.0pt;font-family:"Verdana",sans-serif;color:#006600">Special
Interest Area badges youth members can achieve, supported by Queensland
Environmental Education Team:<o:p></o:p></span></b></p><p class="MsoNormal"><b><span lang="EN-AU"> </span></b></p><p class="MsoNormal"><b><span lang="EN-AU"> </span></b></p><p>
</p><p class="MsoNormal"><b><i><span lang="EN-AU" style="font-size:11.0pt;font-family:"Verdana",sans-serif">Joey Scout SIA Badges
(2 hours each badge)</span></i></b><b><span lang="EN-AU" style="font-size:11.0pt;font-family:"Verdana",sans-serif"><o:p></o:p></span></b></p>
After clean:
cleaned: <p class="MsoNormal"><b>Special
Interest Area badges youth members can achieve, supported by Queensland
Environmental Education Team:<p>
</p><p class="MsoNormal"><b><i>Joey Scout SIA Badges
(2 hours each badge)</i></b><b></b></p>
I tried:
Pattern p = Pattern.compile("<!--.*?-->|<title>.*?</title>|"
+ "<(meta|link|/?o:|/?div|/?std|/?head|/?html|/?body|/?span|!\\[)[^>]*>|"
+ "(<[^>]+>)+ (</\\w+>)+", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
However, the style is still removed.
I had to leave the "span" in as well.
You can use
String html = "Cleaned!<!-- \nsome comment --><title> my title</title> style='OUR_STYLE'<meta ...>";
Pattern p = Pattern.compile("<!--.*?-->|<title>.*?</title>|\\s+style='[^']+'|"
+ "<(meta|link|/?o:|/?style|/?div|/?std|/?head|/?html|/?body|/?span|!\\[)[^>]*>|"
+ "(<[^>]+>)+ (</\\w+>)+", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher m = p.matcher(html);
String result = m.replaceAll("");
System.out.println(result);
// => Cleaned!
See the Java demo .
NOTES :
Pattern.DOTALL
makes .
matcb any chars including line break chars (so no need to use a workaround like [\\w\\W]
) \\s
or \\w
(in a Java string literal, "\\\\s"
or "\\\\w"
)[
or (
, see What special characters must be escaped in regular expressions??
after it (as is the case with >
in your pattern), it makes the char optional.I believe you need to escape the special characters <([{\\^-=$!|]})?*+.>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.