Python 正则表达式捕获各种 url 模式组

Question

I have dataset containing the string like this and I want to remove the all the urls from the it我有包含这样的字符串的数据集，我想从中删除所有的网址

http://google.com having trouble finding regex https://google.com for this case http // google com / test some gibberish https // google. com / test / test1 great http.//google.org

Now, I am using this regex pattern to find the all urls:现在，我正在使用这个正则表达式模式来查找所有网址：

https?:?\s?\/\/\s?\S+

Now, ideally, it should capture all the urls such as in this case,现在，理想情况下，它应该捕获所有 url，例如在这种情况下，

http://google.com
https://google.com
http // google com / test
https // google. com / test / test1
http.//google.org

but with the regex pattern I have, it is capturing only但使用我拥有的正则表达式模式，它只捕获

http://google.com
https://google.com
http // google
https // google

Link to Regex .链接到正则表达式。

Answer 1

You may use您可以使用

https?[:.]?\s?\/\/(?:\s*[^\/\s.]+)+(?:\s*\.\s*[^\/\s.]+)*(?:\s*\/\s*[^\/\s]+)*

See the regex demo .请参阅正则表达式演示。

Details细节

https? - http or https - http或https
[:.]? - an optional : or . - 一个可选的:或.
\s? - an optional whitespace - \/\/ - // char sequence - 一个可选的空格 - \/\/ - //字符序列
(?:\s*[^\/\s.]+)+ - (to match all domain name parts till the last . before TLD) 1 or more occurrences of (?:\s*[^\/\s.]+)+ - （匹配所有域名部分，直到 TLD 之前的最后一个. ）1 次或多次出现
- \s* - 0 or more whitespaces \s* - 0 个或更多空格
- [^\/\s.]+ - 1 or more chars other than / , . [^\/\s.]+ - 除/ , 之外的 1 个或多个字符. and whitespace和空格
(?:\s*\.\s*[^\/\s.]+)* - 0 or more sequences of (?:\s*\.\s*[^\/\s.]+)* - 0 个或多个序列
- \s*\.\s* - a dot enclosed with 0+ whitespaces \s*\.\s* - 用 0+ 个空格括起来的点
- [^\/\s.]+ - 1 or more chars other than / , . [^\/\s.]+ - 除/ , 之外的 1 个或多个字符. and whitespace和空格
(?:\s*\/\s*[^\/\s]+)* - 0 or more sequences of (?:\s*\/\s*[^\/\s]+)* - 0 个或多个序列
- \s*\/\s* - a / enclosed with 0+ whitespaces \s*\/\s* - 一个/包含 0+ 个空格
- [^\/\s]+ - 1 or more chars other than / and whitespace [^\/\s]+ - 除/和空格之外的 1 个或多个字符

Answer 2

The joy of Python: practically anything you can imagine already exists as a library. Python 的乐趣：几乎任何你能想象到的东西都已经作为一个库存在。

See https://github.com/madisonmay/CommonRegex见https://github.com/madisonmay/CommonRegex

This package provides many regex formats (including URL and many others) so you don't have to rediscover how to.此 package 提供了许多正则表达式格式（包括 URL 和许多其他格式），因此您不必重新发现如何使用。

Python 正则表达式捕获各种 url 模式组

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-11 13:14:33

解决方案2
-1 2020-06-11 04:35:04

Python 正则表达式捕获各种 url 模式组

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-11 13:14:33

解决方案2 -1 2020-06-11 04:35:04

解决方案1
1 已采纳 2020-06-11 13:14:33

解决方案2
-1 2020-06-11 04:35:04