简体   繁体   中英

How to call a specific anchor tag and pass it back to the url in a Python webscraper?

I'm working on a problem for an online class, where I'm supposed to use BeautifulSoup to build a simple webscraper.

Here is my progress so far:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

count = int(4)
position = int(3)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")
tags = soup('a', None)
for tag in tags:
    print(tag.get('href', None))

My question is this: How do I extract a particular anchor tag from the list of tags in tag? Also, how can I make the for loop only iterate four times?

assignment details:

作业详细信息

Update:

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

position = int(3)
count = int(4)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

for i in range(count):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    tags = soup('a')
    print(tags[position])

So I can call a tag at a position this way, but I need to know how to iterate the tag at a position. As it is now, my program just prints the third link four times.

Got it!

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl

position = int(17)
count = int(7)

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter - ')

for i in range(count):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, 'html.parser')
    url = soup('a')[position].get('href', None)
    print(url)

As you already know, tags = soup('a') produces quite a long list of links.

You haven't said how you want to search for one of the links. I'll assume that you're selecting by name. Then here's how to search for Montgomery.

>>> soup.find_all(string='Montgomery')
['Montgomery']

Once you've got that you can get the link ('a') element that contains 'Montgomery` in this way:

>>> soup.find_all(string='Montgomery')[0].findParent()
<a href="http://py4e-data.dr-chuck.net/known_by_Montgomery.html">Montgomery</a>

Then you can get the attribute of the link element which is the actual url for Montgomery.

>>> soup.find_all(string='Montgomery')[0].findParent().attrs['href']
'http://py4e-data.dr-chuck.net/known_by_Montgomery.html'

One way of going through a loop at most four times:

count = 0
for tag in tags:
    <do something>
    count += 1
    if count >= 4:
        break

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM