简体   繁体   English

Python 正则表达式捕获各种 url 模式组

[英]Python regex capture various url pattern groups

I have dataset containing the string like this and I want to remove the all the urls from the it我有包含这样的字符串的数据集,我想从中删除所有的网址

http://google.com having trouble finding regex https://google.com for this case http // google com / test some gibberish https // google. com / test / test1 great http.//google.org

Now, I am using this regex pattern to find the all urls:现在,我正在使用这个正则表达式模式来查找所有网址:

https?:?\s?\/\/\s?\S+

Now, ideally, it should capture all the urls such as in this case,现在,理想情况下,它应该捕获所有 url,例如在这种情况下,

  • http://google.com

  • https://google.com

  • http // google com / test

  • https // google. com / test / test1

  • http.//google.org

but with the regex pattern I have, it is capturing only但使用我拥有的正则表达式模式,它只捕获

  • http://google.com

  • https://google.com

  • http // google

  • https // google

Link to Regex .链接到正则表达式

You may use您可以使用

https?[:.]?\s?\/\/(?:\s*[^\/\s.]+)+(?:\s*\.\s*[^\/\s.]+)*(?:\s*\/\s*[^\/\s]+)*

See the regex demo .请参阅正则表达式演示

Details细节

  • https? - http or https - httphttps
  • [:.]? - an optional : or . - 一个可选的:.
  • \s? - an optional whitespace - \/\/ - // char sequence - 一个可选的空格 - \/\/ - //字符序列
  • (?:\s*[^\/\s.]+)+ - (to match all domain name parts till the last . before TLD) 1 or more occurrences of (?:\s*[^\/\s.]+)+ - (匹配所有域名部分,直到 TLD 之前的最后一个. )1 次或多次出现
    • \s* - 0 or more whitespaces \s* - 0 个或更多空格
    • [^\/\s.]+ - 1 or more chars other than / , . [^\/\s.]+ - 除/ , 之外的 1 个或多个字符. and whitespace和空格
  • (?:\s*\.\s*[^\/\s.]+)* - 0 or more sequences of (?:\s*\.\s*[^\/\s.]+)* - 0 个或多个序列
    • \s*\.\s* - a dot enclosed with 0+ whitespaces \s*\.\s* - 用 0+ 个空格括起来的点
    • [^\/\s.]+ - 1 or more chars other than / , . [^\/\s.]+ - 除/ , 之外的 1 个或多个字符. and whitespace和空格
  • (?:\s*\/\s*[^\/\s]+)* - 0 or more sequences of (?:\s*\/\s*[^\/\s]+)* - 0 个或多个序列
    • \s*\/\s* - a / enclosed with 0+ whitespaces \s*\/\s* - 一个/包含 0+ 个空格
    • [^\/\s]+ - 1 or more chars other than / and whitespace [^\/\s]+ - 除/和空格之外的 1 个或多个字符

The joy of Python: practically anything you can imagine already exists as a library. Python 的乐趣:几乎任何你能想象到的东西都已经作为一个库存在。

See https://github.com/madisonmay/CommonRegexhttps://github.com/madisonmay/CommonRegex

This package provides many regex formats (including URL and many others) so you don't have to rediscover how to.此 package 提供了许多正则表达式格式(包括 URL 和许多其他格式),因此您不必重新发现如何使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM