我的问题是,我希望匹配HTML代码中的URL,如下所示: href='example.com'
或使用"
,但我只想提取实际的URL。我尝试匹配它,然后使用数组魔术只获得数组,但由于正则表达式匹配是贪婪的 ,如果有超过1个有理匹配,则会有更多从一个'
开始并以另一个URL结束'
。什么正则表达式适合我的需要?
I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup
!
It's as easy as this:
from BeautifulSoup import BeautifulSoup
HTML = """<a href="https://firstwebsite.com">firstone</a><a href="https://secondwebsite.com">Ihaveurls</a>"""
s = BeautifulSoup(HTML)
for href in s.find_all('a', href=True): print("My URL: ", href['href'])
In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.
import re
html = '<a href="https://www.abcde.com"></a>'
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
print("No Link Found")
else:
print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.