I am working on an assignment where I need to parse this using BeautifulSoup: http://python-data.dr-chuck.net/known_by_Fikret.html
Basically, I need to print the initial URL and find the URL at position 3, access that and find the link at position 3 on that page, etc-- this need to take please four times in total.
This is the code I have so far:
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#url = input('Enter - ')
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
timesToRepeat = '4'
positionInput = '3'
#timesToRepeat = input('Repeat how many times?: ')
#positionInput = input('Enter Position: ')
try:
timesToRepeat = int(timesToRepeat)
positionInput = int(positionInput)
except:
print("please add an number")
quit()
# Retrieve all of the anchor tags
totalCount = 0
currentRepetitionCount = 0
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
#Leave this all alone ^^^^
print("Retrieving: ",url)
for i in range(timesToRepeat):
html = urllib.request.urlopen(url, context=ctx).read()
for tag in tags:
currentRepetitionCount += 1
if not totalCount >= timesToRepeat:
if currentRepetitionCount == positionInput:
#print("current",currentRepetitionCount)
#print("total",totalCount)
#print("Retrieving: ",url)
currentRepetitionCount = 0
totalCount +=1
url = tag.get('href', None)
print("Retrieving: ",url)
I'm getting this:
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anona.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Zoe.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Carmyle.html
But I SHOULD be getting is this:
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html
It seems like the link isn't changing, and is just finding the 3rd position on the initial link each time and I can not for the life of me seem to fix it.
Try to simplify your code getting focus in your question and on main issue. So for eg it do not need an additional check for if not totalCount >= timesToRepeat:
Be aware that I added a +1
to the timesToRepeat
while requesting also first url
in the loop to avoid repetition.
from bs4 import BeautifulSoup
import requests
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
timesToRepeat = 4
positionInput = 3
for i in range(timesToRepeat+1):
print(f'Retrieving: {url}')
soup=BeautifulSoup(requests.get(url).text)
tag = soup.select('a')[positionInput-1]
url = tag.get('href')
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.