[英]Python regex to remove urls and domain names in string
我正在寻找一个正则表达式来从字符串中删除每个URL或域名,以便:
string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page'
变
'this is my content more content and more content'
对于我来说,删除最常见的顶级域名就足够了,因此我尝试了
string = re.sub(r'\w+(.net|.com|.org|.info|.edu|.gov|.uk|.de|.ca|.jp|.fr|.au|.us|.ru|.ch|.it|.nel|.se|.no|.es|.mil)\s?','',string)
但这会删除过多的内容,而不仅仅是网址。 正确的语法是什么?
您应该转义所有这些点,或者更好的是,将点移到组外并转义一次,也可以从非空间捕获直到没有空间,如下所示:
re.sub(r'[\S]+\.(net|com|org|info|edu|gov|uk|de|ca|jp|fr|au|us|ru|ch|it|nel|se|no|es|mil)[\S]*\s?','',string)
下列:
'this is my content domain.com more content http://domain2.org/content and more content domain.net/page thingynet stuffocom'
变为:
'this is my content more content and more content thingynet stuffocom'
这是一个替代解决方案:
import re
f = open('test.txt', 'r')
content = f.read()
pattern = r"[^\s]*\.(com|org|net)\S*"
result = re.sub(pattern, '', content)
print(result)
输入:
this is my content domain.com more content http://domain2.org/content and more content domain.net/page' and https://www.foo.com/page.php
输出:
this is my content more content and more content and
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.