[英]How to extract text, with link and text after the link and another text after br with python
我已將以下string
解析為BeautifulSoup以從中提取數據,但我無法提取其中的一些數據。 嘗試過不同的方法。 我設法找出<a>
標記,鏈接和每個鏈接之外的文本之間的文本。
<html>
<body>
<p align="left">
<font face="Arial, Helvetica, sans-serif" size="2">
<b>
<font size="4">
GOVERNOR:
</font>
</b>
<br/>
</font>
<font face="Arial, Helvetica, sans-serif" size="2">
<a href="http://governor.alabama.gov/">
<strong>
Robert
Bentley (R)*
</strong>
</a>
- Ex-Morgan County Commissioner & State Correctional Officer
<strong>
<br/>
<a href="http://www.facebook.com/stacy.george.3139">
Stacy George
(R)
</a>
- Ex-Morgan County Commissioner & State Correctional Officer
<br/>
Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
<br/>
<a href="http://www.bassforbama.com/">
Kevin Bass (D)
</a>
- Businessman & Ex-Pro Baseball Player
<br/>
<a href="http://www.parkergriffithforcongress.com/">
Parker Griffith
(D)
</a>
- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
</strong>
</font>
</p>
</body>
</html>
這是我用BeautifulSoup的實現
from bs4 import BeautifulSoup
soup = BeautifulSoup(Above_String)
"""for br in soup.find_all("br"):
print br
#print br.nextSibling.content
"""
for link in soup.find_all("a"):
if link.string == None:
print link.strong.string, link.get("href"),link.next_sibling
else:
print link.string, link.get("href"),link.next_sibling,link.next_sibling
上面的代碼打印出如下內容:
> Robert
Bentley (R)*
http://governor.alabama.gov/
> Stacy George
(R)
http://www.facebook.com/stacy.george.3139
- Ex-Morgan County Commissioner & State Correctional Officer
> Kevin Bass (D)
http://www.bassforbama.com/
- Businessman & Ex-Pro Baseball Player
> Parker Griffith
(D)
http://www.parkergriffithforcongress.com/
- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
錯過了第三項
Bob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
請使用BeautifulSoup如何解決這個問題? 我試圖用find_all("br")
來做,但是由於br
標簽返回NoneType
所以它不起作用。
抓住每個鏈接之外的所有文本節點:
from itertools import takewhile
from bs4 import NavigableString
not_link = lambda t: getattr(t, 'name') not in ('a', 'strong')
for link in soup.find_all("a"):
print 'Link contents:'
text = link.text.strip()
for sibling in takewhile(not_link, link.next_siblings):
if isinstance(sibling, NavigableString):
text += unicode(sibling).strip()
else:
text += sibling.text.strip()
print text
打印:
Link contents:
Robert
Bentley (R)*- Ex-Morgan County Commissioner & State Correctional Officer
Link contents:
Stacy George
(R)- Ex-Morgan County Commissioner & State Correctional OfficerBob Starkey (R) - Retired Businessman, '10 State Rep. Candidate & '12 Scottsboro Mayor Candidate
Link contents:
Kevin Bass (D)- Businessman & Ex-Pro Baseball Player
Link contents:
Parker Griffith
(D)- Ex-Congressman, Ex-State Sen., Physician & Ex-Republican
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.