简体   繁体   中英

Scraping from tags without a class using beautifulSoup

If I want to scrape the link from the href attribute in anchor tag and the string "Horizontal Zero Dawn".

Since the anchor tag does not have a class of its own and there are many more anchor tags throughout the source code.

What can I do using beautifulSoup to scrape the data I need ?

<div class="prodName">
 <a href="/product.php?sku=123;name=Horizon Zero Dawn">Horizon Zero Dawn</a></div>

It doesn't matter that the anchor tag doesn't have a class of its own. By finding the parent div , and then finding an anchor with the appropriate href property and text, we can extract the two values required:

from bs4 import BeautifulSoup

page = '<div class="prodName"><a href="/product.php?sku=123;name=Horizon Zero Dawn">Horizon Zero Dawn</a></div>'

soup = BeautifulSoup(page)

div = soup.find('div', {'class': 'prodName'})
a = div.find('a', {'href': True}, text='Horizon Zero Dawn')

print a['href']
print a.get_text()

This prints:

/product.php?sku=123;name=Horizon Zero Dawn
Horizon Zero Dawn

EDIT:

Updating after comments. If you have multiple div elements in the page, you need to loop over them and find all the a elements that exist within each, like so:

import requests
from bs4 import BeautifulSoup

url ='https://in.webuy.com/product.php?scid=1'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text,'html.parser')
for div in soup.findAll('div',{'class':'prodName'}):
    a = div.findAll('a')
    for link in a:
        href = link.get('href')
        print(href)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM