简体   繁体   English

正则表达式,查找不以=“开头的URL

[英]RegEx, look for URL's where it does not start with ="

I am trying to build a function to find URL's in strings and change them into links. 我正在尝试建立一个函数来查找字符串中的URL,并将其更改为链接。 But I do not want to find URL's that is already inside a HTML tag (Like <A> and <IMG> as examples). 但是我不想找到HTML标记内已经存在的URL(例如,以<A><IMG>为例)。

In other words the RegEx should find this and replace it with a link: 换句话说,RegEx应该找到它并用一个链接替换它:

http://www.stackoverflow.com
www.stackoverflow.com
www.stackoverflow.com/logo.gif

But not these URL's (Since they are already formated): 但不是这些URL(因为它们已经被格式化):

<a href="http://www.stackoverflow.com">http://www.stackoverflow.com</a>
<img src="http://www.stackoverflow.com/logo.gif">

I am using a RegEx that is already developed for this, but it does not check if the URL is inside a HTML-element already. 我正在使用已经为此开发的RegEx,但是它不会检查URL是否已在HTML元素内。 ( http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without ) http://blog.mattheworiordan.com/post/13174566389/url-regular-expression-for-links-with-or-without

This is the original RegEx: 这是原始的RegEx:

/((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%@\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/

This the same RegEx with explanations: 这与正则表达式解释相同:

(
  ( // brackets covering match for protocol (optional) and domain
    ([A-Za-z]{3,9}:(?:\/\/)?) // match protocol, allow in format http:// or mailto:
    (?:[\-;:&=\+\$,\w]+@)? // allow something@ for email addresses
    [A-Za-z0-9\.\-]+ // anything looking at all like a domain, non-unicode domains
    | // or instead of above
    (?:www\.|[\-;:&=\+\$,\w]+@) // starting with something@ or www.
    [A-Za-z0-9\.\-]+   // anything looking at all like a domain
  )
  ( // brackets covering match for path, query string and anchor
    (?:\/[\+~%\/\.\w\-]*) // allow optional /path
    ?\??(?:[\-\+=&;%@\.\w]*) // allow optional query string starting with ? 
    #?(?:[\.\!\/\\\w]*) // allow optional anchor #anchor
  )? // make URL suffix optional
)

What I am trying to do is to change this to look for if the URL starts with exactly =" or > and if it does, it should not go through the RegEx. Since the URL inside <A> and <IMG> elements should have one of these combinations right before it starts. 我想做的就是更改它以查找URL是否以=">开头,如果确实如此,则不应通过RegEx。因为<A><IMG>元素内的URL应该具有这些组合之一就在开始之前。

I am not the greatest in RegEx but I have tried and I guess this is my best try so far, but it does not do the trick: 我不是RegEx上的佼佼者,但我已经尝试过了,我想这是迄今为止我最好的尝试,但是并不能解决问题:

/(((^[^\="|\>])([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%@\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;

It is this part I have added: 这是我添加的部分:

(^[^\="|\>])

This is my fiddle: 这是我的小提琴:

http://jsfiddle.net/0w1g4mm9/2/ http://jsfiddle.net/0w1g4mm9/2/

You could try something like this: 您可以尝试这样的事情:

string.replace(
  /(<a[^>]*>[^>]*<\a>)|YOUR_REGEX_HERE/g,
  function(match, link, YOUR_CAPTURE_GROUP_1, etc) {
    if (link) {
        return link
    }
    return YOUR_DESIRED_REPLACEMENT
  }
)

The above matches either already valid <a> tags or the URL-looking strings you are looking for, whichever comes first. 上面的代码与已经有效的<a>标记或您要查找的URL字符串匹配,以先到者为准。 A capturing group is used to detect which of the two was matched. 捕获组用于检测两个匹配的对象。 If a valid link was matched, simply return it unmodified. 如果匹配了有效链接,只需将其未经修改就返回。 Otherwise return your desired replacement. 否则,请返回所需的替换件。

A different aproach which got kind of ugly. 一种不同的方式有点丑陋。 I iterate trough all matches, rebuild the source html for the non matches and for the matches I check the char at matchIndex - 1 and add the link tag or not. 我遍历所有匹配项,为非匹配项重新生成源html,对于匹配项,我在matchIndex-1处检查char并添加或不添加链接标记。

This has the advantage that the already crazy complicated regexp is not getting more complicated and you can use full javascript to check if the current string is part of an html element or not. 这样做的好处是,已经疯狂的复杂正则表达式不会变得越来越复杂,您可以使用完整的javascript检查当前字符串是否为html元素的一部分。

If you factor out the iterate code it might even end up look nice. 如果您排除了迭代代码,它甚至可能看起来不错。

var urlRegEx = /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-]*)?\??(?:[\-\+=&;%@\.\w]*)#?(?:[\.\!\/\\\w]*))?)/g;

var source = $('#source').html();
var dest = "";
var lastMatchEnd = 0;
while ((match = urlRegEx.exec(source)) != null) {
  dest += source.substring(lastMatchEnd, match.index);
  var end = match.index + match[0].length;
  var lastChar = source.charAt(match.index - 1);
  if(lastChar == '"' || lastChar == '>') { // inside link
    dest += match[0];
  } else {
    dest += "<a href=''>" + match[0] + "</a>";
  }
  lastMatchEnd = end;
}
dest += source.substring(lastMatchEnd);
$('#target').html(dest);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM