在Python中使用Regex匹配大於HTML字符

Question

我正在嘗試使用re.compile來匹配網頁上的值

我的網頁包含以下HTML：

<div id="paginate">
&nbsp;<strong>1</strong>
&nbsp;<a href="http://www.link2.com/">2</a>
&nbsp;<a href="http://www.link3.com/">3</a>
&nbsp;<a href="http://www.link2.com">&gt;</a>
&nbsp;&nbsp;<a href="http://www.link20.com/">Last &rsaquo;</a>
</div>

我的正則表達式如下：

re.compile('<a href="(.+?)">&gt;</a>').findall()

這返回

['http://www.link2.com/">2</a>
&nbsp;<a href="http://www.link3.com">3</a>
&nbsp;<a href="http://www.link2.com/']

我只想獲取包含大於符號作為其標簽的鏈接的href？

有任何想法嗎？

提前致謝

Answer 1

只需使用re.findall() ：

>>> re.findall('<a href="(.+?)">&gt;</a>', html)
['http://www.link4.com']

請注意，您實際上應該使用HTML解析器而不是regex解析HTML。 我建議BeautifulSoup ：

>>> from bs4 import BeautifulSoup as BS
>>> soup = BS(html)
>>> print soup.find('a', text='>')
<a href="http://www.link4.com">&gt;</a>
>>> print soup.find('a', text='>')['href']
http://www.link4.com

在Python中使用Regex匹配大於HTML字符

問題描述

1 個解決方案

解決方案1
2 2013-10-15 09:40:19

在Python中使用Regex匹配大於HTML字符

問題描述

1 個解決方案

解決方案1 2 2013-10-15 09:40:19

解決方案1
2 2013-10-15 09:40:19