简体   繁体   English

匹配给定的正则表达式,除非存在给定的单词(向前或向后)

[英]Match a given regex except if a given word exist (lookahead or lookbehind)

I am using javascript regex to parse a series of URLs. 我正在使用javascript正则表达式解析一系列网址。 I need to match a digit in a URL (it's actually more complicated, but I'm simplifying), but only want to match a number where a given word is not in the URL. 我需要匹配URL中的一个数字(实际上更复杂,但是我正在简化),但是只想匹配URL中没有给定单词的数字。

Namely, I want to exclude lines with the word 'changelogs' in them, and would therefore capture ' 1047 ', ' 1048 ', ' 1245 ' and ' 1049 ' from the following list; 也就是说,我想排除其中包含“变更日志”一词的行,因此将从以下列表中捕获“ 1047 ”,“ 1048 ”,“ 1245 ”和“ 1049 ”;

http://www.opera.com/docs/changelogs/unified/1215/
http://www.whatever.com/docs/changelogs/anythingelse/anything/1215/
http://www.blabblah/security/advisory/1047
http://booger/security/advisory/1048/
ftp://msn.global.whatever/somethingelse/1245
whatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/

I know I need some kind of look-around look-ahead look-behind, but I'm striking out. 我知道我需要某种环顾向前的方式,但我很引人注目。 Here is the last pattern I've tried; 这是我尝试过的最后一个模式;

(?!changelogs)(\d+)

Here is the regex101 sandbox I'm using . 这是我正在使用的regex101沙箱

Also, it's important that the only match is the actual number. 同样,重要的是唯一的匹配是实际数字。 I don't want anything else to match . 我别无他法


Here is what my .NET code looks like (note the "BulletinOrAdvisoryPattern" is the regex in question)... 这是我的.NET代码的样子(请注意,“ BulletinOrAdvisoryPattern”是有问题的正则表达式)...

Regex bulletinPattern = new Regex(@matchingDomain.Vendor.BulletinOrAdvisoryPattern, RegexOptions.IgnoreCase );
Match bulletinMatch = bulletinPattern.Match(referenceTitle);

                    if (bulletinMatch.Success)
                    {
                        //Found the bulletin ID in the NVD Reference Title 
                        return bulletinMatch.Value;
                    }

The "ugly" regex you need is 您需要的“丑陋”正则表达式是

(?<=http://www\.opera\.com\b(?!.*/changelogs(?:/|$))\S*)\d+

See the .NET regex demo 请参阅.NET正则表达式演示

However, all you need is 但是,您需要做的就是

var result = input.Contains("/changelogs/") ? "" : input.Trim('/').Split('/').LastOrDefault();

See the IDEONE C# demo : 参见IDEONE C#演示

var lst = new List<string>() {"http://w...content-available-to-author-only...a.com/docs/changelogs/unified/1215/",
    "http://w...content-available-to-author-only...a.com/docs/changelogs/anythingelse/anything/1215/",
    "http://w...content-available-to-author-only...a.com/security/advisory/1047",
    "http://w...content-available-to-author-only...a.com/security/advisory/1048/",
    "http://w...content-available-to-author-only...a.com/doesnt/matter/could/be/anything/1049/"};
lst.ForEach(m => Console.WriteLine(
        m.Contains("/changelogs/") ? "" : m.Trim('/').Split('/').LastOrDefault()
    ));

UPDATE UPDATE

You switched the language from C# to JavaScript that changes the situation drastically since JS regex engine does not support a lookbehind. 您将语言从C#切换为JavaScript,这会极大地改变这种情况,因为JS regex引擎不支持向后查找。

Thus, you have to work around it, and there are means to mimick the lookbehind, or just use capturing mechanism. 因此,您必须解决它,并且有办法模仿背后的外观,或者仅使用捕获机制。

If you can use capturing , try 如果可以使用捕获 ,请尝试

/^(?!.*\/changelogs(?:\/|$)).*\/(\d+)/

See the regex demo 正则表达式演示

 var re = /^(?!.*\\/changelogs(?:\\/|$)).*\\/(\\d+)/gmi; var str = 'http://www.opera.com/docs/changelogs/unified/1215/\\nhttp://www.whatever.com/docs/changelogs/anythingelse/anything/1215/\\nhttp://www.blabblah/security/advisory/1047\\nhttp://booger/security/advisory/1048/\\nftp://msn.global.whatever/somethingelse/1245\\nwhatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/'; var res = []; while ((m = re.exec(str)) !== null) { res.push(m[1]); } document.body.innerHTML = JSON.stringify(res, 0, 4); 

Or, use an optional group (if you are replacing): 或者,使用可选组(如果要替换):

 var re = /(\\/changelogs\\/.*)?\\/(\\d+)/gi; var str = 'http://www.opera.com/docs/changelogs/unified/1215/\\nhttp://www.whatever.com/docs/changelogs/anythingelse/anything/1215/\\nhttp://www.blabblah/security/advisory/1047\\nhttp://booger/security/advisory/1048/\\nftp://msn.global.whatever/somethingelse/1245\\nwhatever/it/doesnt/matter/could/be/anything/i/still/want/this/number/1049/'; var result = str.replace(re, function (m, g1, g2){ return g1 ? m : "NEW_VAL"; }); document.body.innerHTML = result; 

something like the below should do it. 像下面这样的事情应该做。 If you are not only interested in opera, you would be able to tweak this to be more general by replacing opera with .+ Additionally you could match things like com and net with something like (com|net|org|gov) in place of com: 如果您不仅对歌剧感兴趣,还可以通过将歌剧替换为.+来使其更通用.+此外,您可以将com和net之类的内容与(com|net|org|gov)类的内容相匹配。 COM:

http:\/\/www\.opera\.com(?!.*changelogs)(\/[^\/]+)*\/(\d+)\/{0,1}

Here is your regex 101 updated to reflect this 这是您的正则表达式101更新以反映这一点

This pattern excludes lines with 'changelogs' in them and finds the last occurrence of a number encapsulated by slashes. 此模式排除其中包含“ changelogs”的行,并找到最后一个由斜杠封装的数字。

(?:\/)(?!.*changelogs)(?:\/[^\/]+)*\/(\d+)\/{0,1}

Here is the updated regex 101 . 这是更新的正则表达式101

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM