[英]Python regex capture various url pattern groups
I have dataset containing the string like this and I want to remove the all the urls from the it我有包含这样的字符串的数据集,我想从中删除所有的网址
http://google.com having trouble finding regex https://google.com for this case http // google com / test some gibberish https // google. com / test / test1 great http.//google.org
Now, I am using this regex pattern to find the all urls:现在,我正在使用这个正则表达式模式来查找所有网址:
https?:?\s?\/\/\s?\S+
Now, ideally, it should capture all the urls such as in this case,现在,理想情况下,它应该捕获所有 url,例如在这种情况下,
http://google.com
https://google.com
http // google com / test
https // google. com / test / test1
http.//google.org
but with the regex pattern I have, it is capturing only但使用我拥有的正则表达式模式,它只捕获
http://google.com
https://google.com
http // google
https // google
You may use您可以使用
https?[:.]?\s?\/\/(?:\s*[^\/\s.]+)+(?:\s*\.\s*[^\/\s.]+)*(?:\s*\/\s*[^\/\s]+)*
See the regex demo .请参阅正则表达式演示。
Details细节
https?
- http
or https
http
或https
[:.]?
- an optional :
or .
:
或.
\s?
- an optional whitespace - \/\/
- //
char sequence \/\/
- //
字符序列(?:\s*[^\/\s.]+)+
- (to match all domain name parts till the last .
before TLD) 1 or more occurrences of (?:\s*[^\/\s.]+)+
- (匹配所有域名部分,直到 TLD 之前的最后一个.
)1 次或多次出现
\s*
- 0 or more whitespaces \s*
- 0 个或更多空格[^\/\s.]+
- 1 or more chars other than /
, .
[^\/\s.]+
- 除/
, 之外的 1 个或多个字符.
and whitespace(?:\s*\.\s*[^\/\s.]+)*
- 0 or more sequences of (?:\s*\.\s*[^\/\s.]+)*
- 0 个或多个序列
\s*\.\s*
- a dot enclosed with 0+ whitespaces \s*\.\s*
- 用 0+ 个空格括起来的点[^\/\s.]+
- 1 or more chars other than /
, .
[^\/\s.]+
- 除/
, 之外的 1 个或多个字符.
and whitespace(?:\s*\/\s*[^\/\s]+)*
- 0 or more sequences of (?:\s*\/\s*[^\/\s]+)*
- 0 个或多个序列
\s*\/\s*
- a /
enclosed with 0+ whitespaces \s*\/\s*
- 一个/
包含 0+ 个空格[^\/\s]+
- 1 or more chars other than /
and whitespace [^\/\s]+
- 除/
和空格之外的 1 个或多个字符The joy of Python: practically anything you can imagine already exists as a library. Python 的乐趣:几乎任何你能想象到的东西都已经作为一个库存在。
See https://github.com/madisonmay/CommonRegex见https://github.com/madisonmay/CommonRegex
This package provides many regex formats (including URL and many others) so you don't have to rediscover how to.此 package 提供了许多正则表达式格式(包括 URL 和许多其他格式),因此您不必重新发现如何使用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.