简体   繁体   English

从Python中的给定字符串中删除所有形式的URL

[英]Remove all forms of URLs from a given string in Python

I am new to python and was wondering if there was a better solution to match all forms of URLs that might be found in a given string. 我是python的新手,想知道是否有更好的解决方案来匹配可能在给定字符串中找到的所有形式的URL。 Upon googling, there seems to a lot of solutions that extract domains, replace it with links etc, but none that removes / deletes them from a string. 在谷歌搜索,似乎有很多解决方案提取域,用链接等替换它,但没有一个从字符串中删除/删除它们。 I have mentioned some examples below for reference. 我在下面提到了一些例子供参考。 Thanks! 谢谢!

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

Error Log: 错误日志:

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

There's an error in your code (in fact two): 您的代码中存在错误(实际上是两个):

1.You should put a backslash before the penultimate single quote to escape it: 你应该在倒数第二个单引号前面加一个反斜杠来逃避它:

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

2.You shouldn't use str as name for a variable because it's a reserved keyword, so name it thestring or anything else 2.您不应该使用str作为变量的名称,因为它是保留关键字,因此将其命名为thestring或其他任何内容

For ex: 例如:

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string

with the result: 结果:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

Include encoding line at the top of your source file(the regex string contains non-ascii symbols like » ), eg: 在源文件的顶部包含编码行(正则表达式字符串包含非ascii符号,如» ),例如:

# -*- coding: utf-8 -*-
import re
...

Also surround your regex string in triple single(or double)quotes - ''' or """ instead of single as this string already contains quote symbols itself( ' and " ). 也用三重单(或双)引号括起你的正则表达式字符串 - '''"""而不是单一,因为这个字符串已经包含引号符号本身( '" )。

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM