How to call a specific anchor tag and pass it back to the url in a Python webscraper?

Question

I'm working on a problem for an online class, where I'm supposed to use BeautifulSoup to build a simple webscraper.

Here is my progress so far:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

count = int(4)
position = int(3)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a', None)
for tag in tags:
    print(tag.get('href', None))

My question is this: How do I extract a particular anchor tag from the list of tags in tag? Also, how can I make the for loop only iterate four times?

assignment details:

作业详细信息

Update:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

position = int(3)
count = int(4)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

for i in range(count):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    print(tags[position])

So I can call a tag at a position this way, but I need to know how to iterate the tag at a position. As it is now, my program just prints the third link four times.

Answer 1

Got it!

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

position = int(17)
count = int(7)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

for i in range(count):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    url = soup('a')[position].get('href', None)
    print(url)

Answer 2

As you already know, tags = soup('a') produces quite a long list of links.

You haven't said how you want to search for one of the links. I'll assume that you're selecting by name. Then here's how to search for Montgomery.

>>> soup.find_all(string='Montgomery')
['Montgomery']

Once you've got that you can get the link ('a') element that contains 'Montgomery` in this way:

>>> soup.find_all(string='Montgomery')[0].findParent()
<a href="http://py4e-data.dr-chuck.net/known_by_Montgomery.html">Montgomery</a>

Then you can get the attribute of the link element which is the actual url for Montgomery.

>>> soup.find_all(string='Montgomery')[0].findParent().attrs['href']
'http://py4e-data.dr-chuck.net/known_by_Montgomery.html'

One way of going through a loop at most four times:

count = 0
for tag in tags:
    <do something>
    count += 1
    if count >= 4:
        break

How to call a specific anchor tag and pass it back to the url in a Python webscraper?

Question

2 answers

solution1
1 2017-08-30 20:14:33

solution2
0 2017-08-27 21:05:12

How to call a specific anchor tag and pass it back to the url in a Python webscraper?

Question

2 answers

solution1 1 2017-08-30 20:14:33

solution2 0 2017-08-27 21:05:12

solution1
1 2017-08-30 20:14:33

solution2
0 2017-08-27 21:05:12