使用Beautiful Soup按類名獲取內容

Question

使用Beautiful Soup模塊，如何獲取其類名為feeditemcontent cxfeeditemcontent的div標簽的數據？ 是嗎：

soup.class['feeditemcontent cxfeeditemcontent']

要么：

soup.find_all('class')

這是HTML源：

<div class="feeditemcontent cxfeeditemcontent">
    <div class="feeditembodyandfooter">
         <div class="feeditembody">
         <span>The actual data is some where here</span>
         </div>
     </div>
 </div>

這是Python代碼：

 from BeautifulSoup import BeautifulSoup
 html_doc = open('home.jsp.html', 'r')

 soup = BeautifulSoup(html_doc)
 class="feeditemcontent cxfeeditemcontent"

Answer 1

Beautiful Soup 4將“class”屬性的值視為列表而不是字符串，這意味着jadkik94的解決方案可以簡化：

from bs4 import BeautifulSoup                                                   

def match_class(target):                                                        
    def do_match(tag):                                                          
        classes = tag.get('class', [])                                          
        return all(c in classes for c in target)                                
    return do_match                                                             

soup = BeautifulSoup(html)                                                      
print soup.find_all(match_class(["feeditemcontent", "cxfeeditemcontent"]))

Answer 2

試試這個，也許這對於這個簡單的事情來說太過分了，但它有效：

def match_class(target):
    target = target.split()
    def do_match(tag):
        try:
            classes = dict(tag.attrs)["class"]
        except KeyError:
            classes = ""
        classes = classes.split()
        return all(c in classes for c in target)
    return do_match

html = """<div class="feeditemcontent cxfeeditemcontent">
<div class="feeditembodyandfooter">
<div class="feeditembody">
<span>The actual data is some where here</span>
</div>
</div>
</div>"""

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(html)

matches = soup.findAll(match_class("feeditemcontent cxfeeditemcontent"))
for m in matches:
    print m
    print "-"*10

matches = soup.findAll(match_class("feeditembody"))
for m in matches:
    print m
    print "-"*10

Answer 3

soup.findAll("div", class_="feeditemcontent cxfeeditemcontent")

所以，如果我想從stackoverflow.com獲取類頭<div class="header">所有div標簽，那么BeautifulSoup的示例將是：

from bs4 import BeautifulSoup as bs
import requests 

url = "http://stackoverflow.com/"
html = requests.get(url).text
soup = bs(html)

tags = soup.findAll("div", class_="header")

它已經在bs4 文檔中了。

Answer 4

from BeautifulSoup import BeautifulSoup 
f = open('a.htm')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'id':'abc def'})
print list

Answer 5

soup.find("div", {"class" : "feeditemcontent cxfeeditemcontent"})

Answer 6

查看此錯誤報告： https ： //bugs.launchpad.net/beautifulsoup/+bug/410304

正如你所看到的，美麗的湯不能真正理解class="ab"作為兩個類a和b 。

但是，正如它在第一個評論中出現的那樣，一個簡單的正則表達式就足夠了。 在你的情況下：

soup = BeautifulSoup(html_doc)
for x in soup.findAll("div",{"class":re.compile(r"\bfeeditemcontent\b")}):
    print "result: ",x

注意：這已在最近的測試版中得到修復。 我沒有瀏覽最近版本的文檔，也許你可以做到這一點。 或者如果您想使用舊版本使其工作，您可以使用上述內容。

使用Beautiful Soup按類名獲取內容

問題描述

6 個解決方案

解決方案1
22 2012-07-05 14:22:08

解決方案2
10 已采納 2012-07-04 15:16:49

解決方案3
6 2014-07-24 05:29:55

解決方案4
4

解決方案5
3 2012-07-04 14:55:52

解決方案6
0 2012-07-04 14:56:05

使用Beautiful Soup按類名獲取內容

問題描述

6 個解決方案

解決方案1 22 2012-07-05 14:22:08

解決方案2 10 已采納 2012-07-04 15:16:49

解決方案3 6 2014-07-24 05:29:55

解決方案4 4

解決方案5 3 2012-07-04 14:55:52

解決方案6 0 2012-07-04 14:56:05

解決方案1
22 2012-07-05 14:22:08

解決方案2
10 已采納 2012-07-04 15:16:49

解決方案3
6 2014-07-24 05:29:55

解決方案4
4

解決方案5
3 2012-07-04 14:55:52

解決方案6
0 2012-07-04 14:56:05