简体   繁体   中英

How to get objects from div with BeautifulSoup in Python?

I'm not very familliar with BeautifulSoup. I have the html code like (it's only part of it):

<div class="central-featured-lang lang1" lang="en">
<a class="link-box" href="//en.wikibooks.org/">
<strong>English</strong><br>
<em>Open-content textbooks</em><br>
<small>51 000+ pages</small></a>
</div>

On the output I should get (and for other languages):

English: 51 000+ pages.

I tried something like:

for item in soup.find_all('div'):
    print item.get('class')

But this does not work. Can you help me, or at least lead to solution?

item.get() returns attribute values , not text contained under an element.

You can get the text directly contained in elements with the Element.string attribute , or all contained text (recursively) with the Element.get_text() method .

Here, I'd search for div elements with a lang attribute, then use the contained elements to find strings:

for item in soup.find_all('div', lang=True):
    if not (item.strong and item.small):
        continue
    language = item.strong.string
    pages = item.small.string
    print '{}: {}'.format(language, pages)

Demo:

>>> from bs4 import BeautifulSoup
>>> sample = '''\
... <div class="central-featured-lang lang1" lang="en">
... <a class="link-box" href="//en.wikibooks.org/">
... <strong>English</strong><br>
... <em>Open-content textbooks</em><br>
... <small>51 000+ pages</small></a>
... </div>
... '''
>>> soup = BeautifulSoup(sample)
>>> for item in soup.find_all('div', lang=True):
...     if not (item.strong and item.small):
...         continue
...     language = item.strong.string
...     pages = item.small.string
...     print '{}: {}'.format(language, pages)
... 
English: 51 000+ pages

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM