简体   繁体   中英

Python requests and beautifulsoup4, collecting only the “href” links

from bs4 import BeautifulSoup
import requests

url = "https://www.brightscope.com/ratings"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

data = soup.find_all('li',{"class":"more-data"})+soup.findAll('li', {"class":"more-data topten"})
for item in data:
   print(item('a'))

I would like to print only the hrefs but I cannot seem to figure this out. I've looked at different videos and can't get it. What am I doing wrong? I know the above code is printing the contents of the "a" tag but I need just the href's.

What you need is to use the dictionary-like access to element's attributes :

[a['href'] for a in item('a')]

And, as a side note, you can improve the way you are locating your li elements, instead of:

data = soup.find_all('li',{"class":"more-data"})+soup.findAll('li', {"class":"more-data topten"})
for item in data:
   print(item('a'))

You can do:

links = soup.select("li.more-data a")
for a in links:
    print(a["href"])

where li.more-data a is a CSS selector which would match all a elements inside li elements with more-data class.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM