简体   繁体   中英

How can I get information from an <a href> tag within <div> tags with BeautifulSoup and Python?

all. I have a quick question about BeautifulSoup with Python. I have several bits of HTML that look like this (the only differences are the links and product names) and I'm trying to get the link from the "href" attribute.

<div id="productListing1" xmlns:dew="urn:Microsoft.Search.Response.Document">
<span id="rank" style="display:none;">94.36</span>
<div class="productPhoto">
    <img src="/assets/images/ocpimages/87684/00131cl.gif" height="82" width="82" />
</div>
<div class="productName">
    <a class="on" href="/Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131">CAPRI SUN - JUICE DRINK - COOLERS VARIETY PACK 6 OZ</a>
</div>
<div class="size">40 CT</div>

I currently have this Python code:

productLinks = soup.findAll('a', attrs={'class' : 'on'})
for link in productLinks:
    print link['href']

This works (for every link on the page I get something like /Products/ProductInfoDisplay.aspx?SiteId=1&amp;Product=8768400131 ); however, I've been trying to figure out if there's a way to get the link in the "href" attribute without searching explicitly for 'class="on"'. I guess my first question should be whether or not this is the best way to find this information (class="on" seems too generic and likely to break in the future although my CSS and HTML skills aren't that good). I've tried numerous combinations of find, findAll, findAllnext, etc. methods but I can't quite make it work. This is mostly what I had (I rearranged and changed it numerous times):

productLinks = soup.find('div', attrs={'class' : 'productName'}).find('a', href=True)

If this isn't a good way to do this, how can I get to the <a> tag from the <div class="productName"> tag? Let me know if you need more information.

Thank you.

Well, once you have the <div> , element, you can get the <a> subelement by calling find() :

productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
    print div.find('a')['href']

However, since the <a> is immediately above <div> , you can get the a attribute from the div:

productDivs = soup.findAll('div', attrs={'class' : 'productName'})
for div in productDivs:
    print div.a['href']

Now, if you want to put all the <a> elements in a list, your code above will not work because find() just returns one element matched by its criteria. You would get the list of divs and get the subelements from them, for example, using list comprehensions:

productLinks = [div.a for div in 
        soup.findAll('div', attrs={'class' : 'productName'})]
for link in productLinks:
    print link['href']

I am giving this solution in BeautifulSoup4

for data in soup.find_all('div', class_='productName'):
    for a in data.find_all('a'):
        print(a.get('href')) #for getting link
        print(a.text) #for getting text between the link
You can avoid those for loops by specifying the index.
 data = soup.find_all('div', class_='productName') a_class = data[0].find_all('a') url_ = a_class[0].get('href') print(url_)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM