Python requests and beautifulsoup4, collecting only the “href” links

Question

from bs4 import BeautifulSoup
import requests

url = "https://www.brightscope.com/ratings"
headers = {'User-Agent':'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")

data = soup.find_all('li',{"class":"more-data"})+soup.findAll('li', {"class":"more-data topten"})
for item in data:
   print(item('a'))

I would like to print only the hrefs but I cannot seem to figure this out. I've looked at different videos and can't get it. What am I doing wrong? I know the above code is printing the contents of the "a" tag but I need just the href's.

Answer 1

What you need is to use the dictionary-like access to element's attributes :

[a['href'] for a in item('a')]

And, as a side note, you can improve the way you are locating your li elements, instead of:

data = soup.find_all('li',{"class":"more-data"})+soup.findAll('li', {"class":"more-data topten"})
for item in data:
   print(item('a'))

You can do:

links = soup.select("li.more-data a")
for a in links:
    print(a["href"])

where li.more-data a is a CSS selector which would match all a elements inside li elements with more-data class.

Python requests and beautifulsoup4, collecting only the “href” links

Question

1 answers

solution1
2 ACCPTED 2016-12-19 04:47:14

Python requests and beautifulsoup4, collecting only the “href” links

Question

1 answers

solution1 2 ACCPTED 2016-12-19 04:47:14

solution1
2 ACCPTED 2016-12-19 04:47:14