简体   繁体   中英

How to use BeautifulSoup to scrape links in a html

I need download few links in a html. But I don't need all of them, I only need few of them in certain section on this webpage. For example, in http://www.nytimes.com/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning , I need links in the debaters section. I plan to use BeautifulSoup and I looked the html of one of the links:

<a href="/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning/student-data-collection-is-out-of-control" class="bl-bigger">Data Collection Is Out of Control</a>

Here's my code:

r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
link_set = set()
for link in soup.find_all("a", class = "bl-bigger"):
    href = link.get('href')
    if href == None:
        continue
    elif '/roomfordebate/' in href:
        link_set.add(href)    
for link in link_set:
    print link 

This code is supposed to give me all the links with bl-bigger class. But it actually returns nothing. Could anyone figure what's wrong with my code or how to make it work? Thanks

You need to call the method with an object instead of keyword argument:

soup.find("tagName", { "class" : "cssClass" })

or use .select method which executes CSS queries:

soup.select('a.bl-bigger')

Examples are in the docs , just search for '.select' string. Also, instead of writing the entire script you'll quickly get some working code with ipython interactive shell.

I don't see bl-bigger class at all when I view the source from Chrome. May be that's why your code is not working?

Lets start looking at the source. The whole Debaters section seems to be put within div with class nytint-discussion-content . So using BeautifulSoup, lets get that whole div first.

debaters_div = soup.find('div', class_="nytint-discussion-content")

Again learning from the source, seems all the links are within a list, li tag. Now all you have to do is, find all li tags and find anchor tags within them. One more thing you can notice is, all the li tags have class nytint-bylines-1 .

list_items = debaters_div.find_all("li", class_="nytint-bylines-1")
list_items[0].find('a')
# <a href="/roomfordebate/2014/09/24/protecting-student-privacy-in-online-learning/student-data-collection-is-out-of-control">Data Collection Is Out of Control</a>

So, your whole code can be:

link_set = set()
response = requests.get(url)
html_data = response.text
soup = BeautifulSoup(html_data)
debaters_div = soup.find('div', class_="nytint-discussion-content")
list_items = debaters_div.find_all("li", class_="nytint-bylines-1")

for each_item in list_items:
    html_link = each_item.find('a').get('href')
    if html_link.startswith('/roomfordebate'):
        link_set.add(html_link)

Now link_set will contain all the links you want. From the link given in question, it will fetch 5 links.

PS: link_set contains only uri and not actual html addresses. So I would add http://www.nytimes.com at start before adding those links to link_set . Just change the last line to:

link_set.add('http://www.nytimes.com' + html_link)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM