[英]How iterate links and access one on specific position?
我正在做一個任務,我需要使用 BeautifulSoup 來解析它:http: //python-data.dr-chuck.net/known_by_Fikret.html
基本上,我需要打印初始 URL 並在位置 3 處找到 URL,訪問該 URL 並在該頁面上的位置 3 處找到鏈接,等等——這總共需要四次。
這是我到目前為止的代碼:
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#url = input('Enter - ')
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
timesToRepeat = '4'
positionInput = '3'
#timesToRepeat = input('Repeat how many times?: ')
#positionInput = input('Enter Position: ')
try:
timesToRepeat = int(timesToRepeat)
positionInput = int(positionInput)
except:
print("please add an number")
quit()
# Retrieve all of the anchor tags
totalCount = 0
currentRepetitionCount = 0
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
#Leave this all alone ^^^^
print("Retrieving: ",url)
for i in range(timesToRepeat):
html = urllib.request.urlopen(url, context=ctx).read()
for tag in tags:
currentRepetitionCount += 1
if not totalCount >= timesToRepeat:
if currentRepetitionCount == positionInput:
#print("current",currentRepetitionCount)
#print("total",totalCount)
#print("Retrieving: ",url)
currentRepetitionCount = 0
totalCount +=1
url = tag.get('href', None)
print("Retrieving: ",url)
我得到這個:
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anona.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Zoe.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Carmyle.html
但我應該得到的是:
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html
似乎鏈接沒有改變,每次都只是在初始鏈接上找到第三個位置,我似乎無法終生修復它。
嘗試簡化您的代碼,將重點放在您的問題和主要問題上。 因此,例如, if not totalCount >= timesToRepeat:
請注意,在請求循環中的第一個url
以避免重復時,我在timesToRepeat
中添加了+1
。
from bs4 import BeautifulSoup
import requests
url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
timesToRepeat = 4
positionInput = 3
for i in range(timesToRepeat+1):
print(f'Retrieving: {url}')
soup=BeautifulSoup(requests.get(url).text)
tag = soup.select('a')[positionInput-1]
url = tag.get('href')
Retrieving: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Montgomery.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Mhairade.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Butchi.html
Retrieving: http://py4e-data.dr-chuck.net/known_by_Anayah.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.