Java正则表达式可启用文本内的链接，但已包含的标记除外

Question

Suppose a user input text that contains HTML and maybe links , I want to enable links , and make already_tag_closed urls intact . 假设一个用户输入的文本包含HTML以及可能的链接，我想启用链接，并完整地保存has_tag_closed网址。

(I know there are a lot of regex url pattern questions asked , but I cannot find this solution) （我知道有很多正则表达式网址模式问题，但我找不到此解决方案）

for example : 例如：

String urlRegex = "((https?|ftp|gopher|telnet|file):((//)|(\\\\))+[\\w\\d:#@%/;$()~_?\\+-=\\\\\\.&]*)";
Pattern urlPattern = Pattern.compile(urlRegex, Pattern.CASE_INSENSITIVE);

String s = ...
urlPattern.matcher(s).replaceAll("<a href='$0' target='_blank'>$0</a>")

It can translate "google https://google.com" to google <a href='https://google.com' target='_blank'>https://google.com</a> , good. 可以将"google https://google.com"为google <a href='https://google.com' target='_blank'>https://google.com</a> ，很好。

But if string is 但是如果字符串是

"<a href=\"http://www.google.com/\">google</a> " +
" http://www.google.com/  " +
" <a href=\"https://facebook.com/\">facebook</a> " +
" https://facebook.com ";

It will become 它将成为

<a href="<a href='http://www.google.com/' target='_blank'>http://www.google.com/</a>">google</a>  <a href='http://www.google.com/' target='_blank'>http://www.google.com/</a>   <a href="<a href='https://facebook.to/' target='_blank'>https://facebook.to/</a>">facebook</a>  <a href='https://facebook.com' target='_blank'>https://facebook.com</a>

It should not touch values in href , so I change the urlRegex to : 它不应该触摸href值，因此我将urlRegex更改为：

urlRegexExceptAnchor = "(?<!\\<a\\ href=\")(http|https):\\/\\/[^ ]*";

Well , it can handle text mixed with anchor tags. 好吧，它可以处理混合了锚标记的文本。

But , if the text includes iframe , it will fail again : 但是，如果文本包含iframe ，它将再次失败：

<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fnytimes%2Fposts%2F10151112309519999&width=500" width="500" height="525" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>

becomes 变

<iframe src="<a href='https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fnytimes%2Fposts%2F10151112309519999&width=500"' target='_blank'>https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fnytimes%2Fposts%2F10151112309519999&width=500"</a> width="500" height="525" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true"></iframe>

It's invalid again. 再次无效。

I think I'll face more and more situation , because There are a lot of tags that accepts URLs . 我认为我将面临越来越多的情况，因为有很多接受URL的标签。 I cannot just escape a or iframe tags... 我无法逃脱a或iframe代码...

The text is input by user , sure I can filter out some invalid tags , such as form , head , input ... , but there are still a lot of tags to process... ( or even inlined css background url ) 文本是由用户输入的，请确保我可以过滤掉一些无效的标签，例如form ， head ， input ...，但是仍然有很多标签需要处理...（甚至是内联的CSS背景url）

What I can think of now is to use something like JSoup to transfer the whole text to html doc , and process the textNode one by one. 我现在可以想到的是使用JSoup东西将整个文本传输到html doc，并一个接一个地处理textNode。 But I think that's too overkill. 但是我认为这太过分了。 (Each page display will invoke JSoup ... ) （每个页面显示都将调用JSoup ...）

Is there any easier ways to achieve this ? 有没有更简单的方法来实现这一目标？

Answer 1

To anyone facing similar problems , this is my JSoup solution : 对于任何面临类似问题的人，这是我的JSoup解决方案：

  private static void processNode(Node node) {
    if (node instanceof TextNode) {

      Node parent = node.parent();
      if (parent != null && (StringUtils.equalsAnyIgnoreCase(parent.nodeName(),
        "a", "iframe", "embed" , "img" , "object" , "script" , "video" , "applet"))) {
        logger.debug("parent = {} , skipped", parent.nodeName());
      }
      else {
        TextNode textNode = (TextNode) node;

        String text = textNode.text();
        text = urlPattern.matcher(text).replaceAll("<a href='$0' target='_blank'>$0</a>");

        TextNode r = new TextNode(text , null);
        node.replaceWith(r);
      }
    } else if (node instanceof Element) {
      Element ele = (Element) node;
      for (Node childNode : ele.childNodes()) {
        processNode(childNode);
      }
    }
  }

It works fine ... (for now) 运作良好...（目前）

Java正则表达式可启用文本内的链接，但已包含的标记除外

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-03-17 19:55:57

Java正则表达式可启用文本内的链接，但已包含的标记除外

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-03-17 19:55:57

解决方案1
0 已采纳 2017-03-17 19:55:57