[英]Regex check if given string is relative URL
First, I have read this question about how to check if string is an absolute or relative URL.首先,我已经阅读了这个问题有关如何检查是否字符串是一个绝对或相对URL。 My problem is I need a regex to check if a given string is a relative URL or not, ie I need a regex to check if a string does not start with any protocol or double slash //
.我的问题是我需要一个正则表达式来检查给定的字符串是否是相对URL,即我需要一个正则表达式来检查字符串是否不以任何协议或双斜杠//
开头。
Actually, I am doing web scraping with Beautiful Soup and I want to retrieve all relative links.实际上,我正在使用Beautiful Soup进行网页抓取,我想检索所有相关链接。 Beautiful Soup uses this syntax: Beautiful Soup使用以下语法:
soup.findAll(href=re.compile(REGEX_TO_MATCH_RELATIVE_URL))
So, that's why I need this.所以,这就是为什么我需要这个。
Test cases are测试用例是
about.html
tutorial1/
tutorial1/2.html
/
/experts/
../
../experts/
../../../
./
./about.html
Thank you so much.非常感谢。
Since you find it helpful, I am posting my suggestion.既然你觉得它有帮助,我就把我的建议贴出来。
The regular expression can be:正则表达式可以是:
^(?!www\.|(?:http|ftp)s?://|[A-Za-z]:\\|//).*
Note that it is becoming more and more unreadable if you start adding exclusions or more alternatives.请注意,如果您开始添加排除项或更多替代项,它会变得越来越不可读。 Thus, perhaps, use VERBOSE mode (declared with re.X
):因此,也许,使用 VERBOSE 模式(用re.X
声明):
import re
p = re.compile(r"""^ # At the start of the string, ...
(?! # check if next characters are not...
www\. # URLs starting with www.
|
(?:http|ftp)s?:// # URLs starting with http, https, ftp, ftps
|
[A-Za-z]:\\ # Local full paths starting with [drive_letter]:\
|
// # UNC locations starting with //
) # End of look-ahead check
.* # Martch up to the end of string""", re.X)
print(p.search("./about.html")); # => There is a match
print(p.search("//dub-server1/mynode")); # => No match
See IDEONE demo看IDEONE 演示
The other Washington Guedes's regexes其他华盛顿 Guedes 的正则表达式
^([a-z0-9]*:|.{0})\\/\\/.*$
- matches ^([a-z0-9]*:|.{0})\\/\\/.*$
- 匹配
^
- beginning of the string ^
- 字符串的开头([a-z0-9]*:|.{0})
- 2 alternatives: ([a-z0-9]*:|.{0})
- 2 种选择:[a-z0-9]*:
- 0 or more letters or digits followed with :
[a-z0-9]*:
- 0 个或多个字母或数字后跟:
.{0}
- an empty string .{0}
- 空字符串\\/\\/.*
- //
and 0 or more characters other than newline (note you do not need to escape /
in Python) \\/\\/.*
- //
和 0 个或多个除换行符以外的字符(注意在 Python 中不需要转义/
)$
- end of string $
- 字符串结尾So, you can rewrite it as ^(?:[a-z0-9]*:)?//.*$
.因此,您可以将其重写为^(?:[a-z0-9]*:)?//.*$
。 he i
flag should be used with this regex.他i
标志应该与这个正则表达式一起使用。
^[^\\/]+\\/[^\\/].*$|^\\/[^\\/].*$
- is not optimal and has 2 alternatives ^[^\\/]+\\/[^\\/].*$|^\\/[^\\/].*$
- 不是最优的,有 2 个选择Alternative 1:备选方案 1:
^
- start of string ^
- 字符串的开始[^\\/]+
- 1 or more characters other than /
[^\\/]+
- 除/
之外的 1 个或更多字符\\/
- Literal /
\\/
- 文字/
[^\\/].*$
- a character other than /
followed by any 0 or more characters other than a newline [^\\/].*$
- 除/
以外的字符,后跟除换行符以外的任意 0 个或多个字符Alternative 2:备选方案 2:
^
- start of string ^
- 字符串的开始\\/
- Literal /
\\/
- 文字/
[^\\/].*$
- a symbol other than /
followed by any 0 or more characters other than a newline up to the end of string. [^\\/].*$
- 除/
之外的符号,后跟除换行符以外的任何 0 个或多个字符,直到字符串末尾。 It is clear that the whole regex can be shortened to ^[^/]*/[^/].*$
.很明显,整个正则表达式可以缩短为^[^/]*/[^/].*$
。 The i
option can safely be removed from the regex flags. i
选项可以安全地从正则表达式标志中删除。
To match absolutes:匹配绝对值:
/^([a-z0-9]*:|.{0})\/\/.*$/gmi
And to match relatives:并匹配亲戚:
/^[^\/]+\/[^\/].*$|^\/[^\/].*$/gmi
I prefer this one, it captures more edge cases:我更喜欢这个,它捕获了更多的边缘情况:
(?:url\\(|<(?:link|script|img)[^>]+(?:src|href)\\s*=\\s*)(?!['"]?(?:data|http))['"]?([^'"\\)\\s>]+)
Source: https://www.regextester.com/94254来源: https : //www.regextester.com/94254
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.