简体   繁体   English

Python正则表达式删除字符串中的URL和域名

[英]Python regex to remove urls and domain names in string

I'm looking for a regex to remove every url or domain name from a string, so that: 我正在寻找一个正则表达式来从字符串中删除每个URL或域名,以便:

string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page'

becomes

'this is my content more content and more content'

Removing the most common tlds is enough for me, so I tried 对于我来说,删除最常见的顶级域名就足够了,因此我尝试了

string = re.sub(r'\w+(.net|.com|.org|.info|.edu|.gov|.uk|.de|.ca|.jp|.fr|.au|.us|.ru|.ch|.it|.nel|.se|.no|.es|.mil)\s?','',string)

but this is removing too much stuff and not only urls. 但这会删除过多的内容,而不仅仅是网址。 What would be the correct syntax? 正确的语法是什么?

you should escape all those dots, or better yet, move the dot outside the group and escape it once, also you could capture from not-space until not space, like this: 您应该转义所有这些点,或者更好的是,将点移到组外并转义一次,也可以从非空间捕获直到没有空间,如下所示:

re.sub(r'[\S]+\.(net|com|org|info|edu|gov|uk|de|ca|jp|fr|au|us|ru|ch|it|nel|se|no|es|mil)[\S]*\s?','',string)

the following: 下列:
'this is my content domain.com more content http://domain2.org/content and more content domain.net/page thingynet stuffocom'
becomes: 变为:

'this is my content more content and more content thingynet stuffocom'

This is an alternative solution: 这是一个替代解决方案:

import re
f = open('test.txt', 'r')
content = f.read()
pattern = r"[^\s]*\.(com|org|net)\S*"
result = re.sub(pattern, '', content)
print(result)

Input: 输入:

this is my content domain.com more content http://domain2.org/content and more content domain.net/page' and https://www.foo.com/page.php 

Output: 输出:

this is my content  more content  and more content  and

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM