简体   繁体   English

使用正则表达式重新字符串匹配提取URL链接-Python

[英]Extracting URL link using regular expression re - string matching - Python

I've been trying to extract URLs from a text file using re api. 我一直在尝试使用re api从文本文件中提取URL。 any link that starts with http:// , https:// and www. 以http://,https://和www开头的任何链接。

the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging. 该文件包含文本以及html源代码,html部分很容易,因为我可以使用BeautifulSoup提取它们,但是普通文本似乎更具挑战性。 I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL. 我在网上发现了这似乎是URL提取的最佳实现,但是它在某些标签上失败了,特别是它无法处理标签并将其包含在URL中。 any help is appreciated, because I'm not familiar with string matching at all myself 感谢您提供任何帮助,因为我自己对字符串匹配一点都不熟悉

here is the signature 这是签名

sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))
sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))

examples: 例子:

http://www.website.com/science/</span></a><o:p></o:p></span></div><div
www.website.com/library/</span></a></span></i><span
http://awebsite.com/Groups</a><div>
re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

The [^\\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like: [^\\s<>"]+部分与任何非空格,非引号,非尖括号字符匹配,以避免与以下字符串匹配:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM