Python正则表达式删除字符串中的URL和域名

Question

I'm looking for a regex to remove every url or domain name from a string, so that: 我正在寻找一个正则表达式来从字符串中删除每个URL或域名，以便：

string='this is my content domain.com more content http://domain2.org/content and more content domain.net/page'

becomes 变

'this is my content more content and more content'

Removing the most common tlds is enough for me, so I tried 对于我来说，删除最常见的顶级域名就足够了，因此我尝试了

string = re.sub(r'\w+(.net|.com|.org|.info|.edu|.gov|.uk|.de|.ca|.jp|.fr|.au|.us|.ru|.ch|.it|.nel|.se|.no|.es|.mil)\s?','',string)

but this is removing too much stuff and not only urls. 但这会删除过多的内容，而不仅仅是网址。 What would be the correct syntax? 正确的语法是什么？

Answer 1

you should escape all those dots, or better yet, move the dot outside the group and escape it once, also you could capture from not-space until not space, like this: 您应该转义所有这些点，或者更好的是，将点移到组外并转义一次，也可以从非空间捕获直到没有空间，如下所示：

re.sub(r'[\S]+\.(net|com|org|info|edu|gov|uk|de|ca|jp|fr|au|us|ru|ch|it|nel|se|no|es|mil)[\S]*\s?','',string)

the following: 下列：
'this is my content domain.com more content http://domain2.org/content and more content domain.net/page thingynet stuffocom'
becomes: 变为：

'this is my content more content and more content thingynet stuffocom'

Answer 2

This is an alternative solution: 这是一个替代解决方案：

import re
f = open('test.txt', 'r')
content = f.read()
pattern = r"[^\s]*\.(com|org|net)\S*"
result = re.sub(pattern, '', content)
print(result)

Input: 输入：

this is my content domain.com more content http://domain2.org/content and more content domain.net/page' and https://www.foo.com/page.php

Output: 输出：

this is my content  more content  and more content  and

Python正则表达式删除字符串中的URL和域名

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-02-26 14:13:42

解决方案2
0 2019-02-26 14:40:43

Python正则表达式删除字符串中的URL和域名

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-02-26 14:13:42

解决方案2 0 2019-02-26 14:40:43

解决方案1
2 已采纳 2019-02-26 14:13:42

解决方案2
0 2019-02-26 14:40:43