简体   繁体   中英

Regular expression failing

I am trying to parse URLs in a string of text. Currently my RegEx pattern looks like this:

(http(s)?://)?\S+\.(com|net|org|edu)\S*(?<!\W)

Sample text:

On that sample page (http://example.com/test/new.php), when you use the button, they are there, but when you use the inline, they are not.

Right now it keeps capturing the opening ( . I cant seem to get this right. Any tips? I am using .NET 4.0 and C# to try and parse this.

UPDATE: a sample text more reflective of the links it needs to capture

On that sample page (http://example.com/test/new.php), when you use the button, it redirects to sample.com/help instead of https://www.example.com or just example.com

Because you have a ? after your first group (http(s)?://)? , the regex engine is free to backtrack and try the expression without matching it. Because the next part of the expression is \\S*+ it is free to match the parenthesis and the rest of the url as well.

Removing the ? should do the trick in this case, but doesn't solve the problem of making it optional. Let me know if that part actually needs to be optional and maybe give some additional sample data.

如果在正则表达式前面添加\\b (单词边界)锚,它将按预期工作:

\\b(http(s)?://)?\\S+\\.(com|net|org|edu)\\S*(?<!\\W)

The problem is that the \\S+ is matching more greedily than the (http(s)?://)?

Your expression effectively becomes:

\S+\.(com|net|org|edu)\S*(?<!\W)

You can see this by removing the "?" from the http expression:

(http(s)?://)\S+\.(com|net|org|edu)\S*(?<!\W)

You also might want to read this for more thoughts on the problem's real difficulty.

https://mathiasbynens.be/demo/url-regex

Thanks to gymbrall showing me why its wrong, and to PaulF for pointing me to a stackoverflow question with a partial answer. I was able to modify the regex in this question to fit my needs:

((http|ftp|https):\/\/)*([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?(?<!\W)

With the sample text:

On that sample page (http://example.com/test/new.php), when you use the button, it redirects to sample.com/help instead of https://www.example.com or just example.com

The regex will correctly match:

http://example.com/test/new.php
sample.com/help
https://www.example.com
example.com

I'm not 100% sure why that isn't working, but this one should get the job done for you.

(http://?|https://?)\S+\.(com|net|org|edu)\S*(?<!\W)

Give it a shot here: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM