[英]Get contents by class names using Beautiful Soup
使用Beautiful Soup模塊,如何獲取其類名為feeditemcontent cxfeeditemcontent
的div
標簽的數據? 是嗎:
soup.class['feeditemcontent cxfeeditemcontent']
要么:
soup.find_all('class')
這是HTML源:
<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>
這是Python代碼:
from BeautifulSoup import BeautifulSoup
html_doc = open('home.jsp.html', 'r')
soup = BeautifulSoup(html_doc)
class="feeditemcontent cxfeeditemcontent"
Beautiful Soup 4將“class”屬性的值視為列表而不是字符串,這意味着jadkik94的解決方案可以簡化:
from bs4 import BeautifulSoup
def match_class(target):
def do_match(tag):
classes = tag.get('class', [])
return all(c in classes for c in target)
return do_match
soup = BeautifulSoup(html)
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))
試試這個,也許這對於這個簡單的事情來說太過分了,但它有效:
def match_class(target):
target = target.split()
def do_match(tag):
try:
classes = dict(tag.attrs)["class"]
except KeyError:
classes = ""
classes = classes.split()
return all(c in classes for c in target)
return do_match
html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
print m
print "-"*10
matches = soup.findAll(match_class("feeditembody"))
for m in matches:
print m
print "-"*10
soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")
所以,如果我想從stackoverflow.com獲取類頭<div class="header">
所有div標簽,那么BeautifulSoup的示例將是:
from bs4 import BeautifulSoup as bs
import requests
url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)
tags = soup.findAll("div", class_="header")
它已經在bs4 文檔中了 。
from BeautifulSoup import BeautifulSoup
f = open('a.htm')
soup = BeautifulSoup(f)
list = soup.findAll('div', attrs={'id':'abc def'})
print list
soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})
查看此錯誤報告: https : //bugs.launchpad.net/beautifulsoup/+bug/410304
正如你所看到的,美麗的湯不能真正理解class="ab"
作為兩個類a
和b
。
但是,正如它在第一個評論中出現的那樣,一個簡單的正則表達式就足夠了。 在你的情況下:
soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
print "result: ",x
注意:這已在最近的測試版中得到修復。 我沒有瀏覽最近版本的文檔,也許你可以做到這一點。 或者如果您想使用舊版本使其工作,您可以使用上述內容。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.