Suppose a user input text that contains HTML and maybe links , I want to enable links , and make already_tag_closed urls intact .
(I know there are a lot of regex url pattern questions asked , but I cannot find this solution)
for example :
String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern urlPattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);
String s = ...
urlPattern.matcher(s).replaceAll("<a href='$0' target='_blank'>$0</a>")
It can translate "google https://google.com"
to google <a href='https://google.com' target='_blank'>https://google.com</a>
, good.
But if string is
"<a href=\"http://www.google.com/\">google</a> " +
" http://www.google.com/ " +
" <a href=\"https://facebook.com/\">facebook</a> " +
" https://facebook.com ";
It will become
<a href="<a href='http://www.google.com/' target='_blank'>http://www.google.com/</a>">google</a> <a href='http://www.google.com/' target='_blank'>http://www.google.com/</a> <a href="<a href='https://facebook.to/' target='_blank'>https://facebook.to/</a>">facebook</a> <a href='https://facebook.com' target='_blank'>https://facebook.com</a>
It should not touch values in href
, so I change the urlRegex
to :
urlRegexExceptAnchor = "(?<!\\<a\\ href=\")(http|https):\\/\\/[^ ]*";
Well , it can handle text mixed with anchor tags.
But , if the text includes iframe
, it will fail again :
<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fnytimes%2Fposts%2F10151112309519999&width=500" width="500" height="525" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>
becomes
<iframe src="<a href='https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fnytimes%2Fposts%2F10151112309519999&width=500"' target='_blank'>https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fnytimes%2Fposts%2F10151112309519999&width=500"</a> width="500" height="525" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>
It's invalid again.
I think I'll face more and more situation , because There are a lot of tags that accepts URLs . I cannot just escape a
or iframe
tags...
The text is input by user , sure I can filter out some invalid tags , such as form
, head
, input
... , but there are still a lot of tags to process... ( or even inlined css background url )
What I can think of now is to use something like JSoup
to transfer the whole text to html doc , and process the textNode one by one. But I think that's too overkill. (Each page display will invoke JSoup
... )
Is there any easier ways to achieve this ?
To anyone facing similar problems , this is my JSoup
solution :
private static void processNode(Node node) {
if (node instanceof TextNode) {
Node parent = node.parent();
if (parent != null && (StringUtils.equalsAnyIgnoreCase(parent.nodeName(),
"a", "iframe", "embed" , "img" , "object" , "script" , "video" , "applet"))) {
logger.debug("parent = {} , skipped", parent.nodeName());
}
else {
TextNode textNode = (TextNode) node;
String text = textNode.text();
text = urlPattern.matcher(text).replaceAll("<a href='$0' target='_blank'>$0</a>");
TextNode r = new TextNode(text , null);
node.replaceWith(r);
}
} else if (node instanceof Element) {
Element ele = (Element) node;
for (Node childNode : ele.childNodes()) {
processNode(childNode);
}
}
}
It works fine ... (for now)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.