使用BeautifulSoup根據包含在其中的字符串提取li元素

Question

我一直在嘗試使用BeautifulSoup檢索任何包含以下單詞的任何格式的<li>元素： Ottawa 。 問題在於， ottawa永遠不會位於自己的標簽之內，例如<p> 。 所以我只想打印包含Ottawa li元素。

HTML格式如下：

<html>
<body>
<blockquote>
<ul><li><a href="http://link.com"><b>name</b></a>
(National: Ottawa, ON)
<blockquote> some description </blockquote></li>
<li><a href="http://link2.com"><b>name</b></a>
(National: Vancouver, BC)
<blockquote> some description </blockquote></li>
<li><a href="http://link3.com"><b>name</b></a>
(Local: Ottawa, ON)
<blockquote> some description </blockquote></li>
</ul>
</blockquote>
</body>
</html>

我的代碼如下：

from bs4 import BeautifulSoup
import re
import urllib2,sys

url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

re1='.*?'
re2='(Ottawa)'
ottawa = soup.findAll(text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))
search = soup.findAll('li')

上面代碼的結果正確地找到了渥太華，並且當使用它來查找li元素時，它確實找到了li元素，但是它給了我頁面上的每一個。

我了解他們目前沒有聯系search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))因為嘗試在[]執行search = soup.findAll('li', text=re.compile(re1+re2,re.IGNORECASE|re.DOTALL))結果

我的最終目標基本上是獲取每個包含Ottawa <li>元素，並為我提供整個<li>元素的名稱，說明，鏈接等。

Answer 1

使用text屬性來過濾findAll的結果：

elems = [elem for elem in soup.findAll('li') if 'Ottawa' in str(elem.text)]

Answer 2

from bs4 import BeautifulSoup
import re
import urllib2,sys

url = "http://www.charityvillage.ca/cv/nonpr/nonpr1.html"
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

for item in soup.find_all(text=re.compile('\(.+: Ottawa', re.IGNORECASE)):
    link = item.find_previous_sibling(lambda tag: tag.has_key('href'))
    if link is None:
        continue
    print(u'{} [{}]: {}'.format(link.text,
                               item.strip(),
                               link['href']).encode('utf8'))

使用BeautifulSoup根據包含在其中的字符串提取li元素

問題描述

2 個解決方案

解決方案1
3 2012-05-03 20:25:55

解決方案2
2 已采納 2012-05-04 12:15:41

使用BeautifulSoup根據包含在其中的字符串提取li元素

問題描述

2 個解決方案

解決方案1 3 2012-05-03 20:25:55

解決方案2 2 已采納 2012-05-04 12:15:41

解決方案1
3 2012-05-03 20:25:55

解決方案2
2 已采納 2012-05-04 12:15:41