简体   繁体   中英

BeautifulSoup Scraping Span Class HTML

I am trying to scrape from the <span class= ''> . The code looks like this on the pages I am scraping:

    < span class = "catnum"> Disc Number < / span>
    "1"
    < br >
    < span class = "catnum"> Track Number < / span>
    "1"
    < br>
    < span class = "catnum" > Duration < /span>
    "5:28"
    <br>

What I need to get are those numbers after the </span> tag. I should also mention I am writing a larger piece of code that is scraping 1200 sites and this will have to loop over 1200 sites where the numbers in the quotation marks will change from page to page.

I tried this code as a test on one page:

    from bs4 import BeautifulSoup

    soup = BeautifulSoup (open("Smith.html"), "html.parser")

    for tag in soup.findAll('span'):
        if tag.has_key('class'):
            if tag['class'] == 'catnum':
                print tag.string

I know that will print ALL the 'span class' tags and not just the three I want, but I thought I would still test it to see if it worked and I got this error:

/Library/Python/2.7/site-packages/bs4/element.py:1527: UserWarning: has_key is deprecated. Use has_attr("class") instead. key))

as said in the error message, you should use tag.has_attr("class") in place of the deprecated tag.has_key("class") method.

Hope it helps.

Simone

You can constrain your search by attribute {'class': 'catnum'} and the text inside text=re.compile('Disc Number') . Then use .next_sibling to find the text:

from bs4 import BeautifulSoup
import re
s = '''
    <span class = "catnum"> Disc Number </span>
    "1"
    <br/>
    <span class = "catnum"> Track Number </span>
    "1"
    <br/>
    <span class = "catnum"> Duration </span>
    "5:28"
    <br/>'''

soup = BeautifulSoup(s, 'html.parser')
span = soup.find('span', {'class': 'catnum'}, text=re.compile(r'Disc Number'))
print span.next_sibling

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM