[英]Loop through URL using Python
我看了幾個問題,但似乎沒有一個答案適合。 我正在構建一個 webscraper 工具作為個人項目。 我已經找到了獲取 Vuelta 2022 車手數據的循環,但是我需要遍歷每個階段的所有 url。 出於某種原因,url 循環采用范圍內的最后一個數字。 我的直覺是格式,所以我試圖玩弄它,但沒有運氣
import requests
from bs4 import BeautifulSoup
import pandas as pd
for j in range (1,10):
url = (f"https://www.lavuelta.es/en/rankings/stage-{j}")
page = requests.get(url)
urlt = page.content
soup = BeautifulSoup(urlt)
rider_rank_list = []
for i in range (1,11):
#create list of riders
results = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td.runner.is-sticky > a ")
#create rider rank list
rrank = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td:nth-child(1)")
#create stage name
stage = str.replace(str.title(url.rsplit('/', 1)[-1]),'-',' ')
rider_rank_list.append((str(stage),str.strip(results.text), str.strip(rrank.text)))
print(rider_rank_list)
df = pd.DataFrame(rider_rank_list, columns=['stage','rider','rank'], index=None)
print(df)
df.to_csv('data.csv', index=False)
所有數據都在一個表中。 所以沒有下一頁選項。 您只能使用 pandas DataFrame,因為它們是 static Z4C4AD5FCA2E7A3AAF74DBB1CED0038
import pandas as pd
url = "https://www.lavuelta.es/en/rankings/stage#"
df = pd.read_html(url)[0].to_csv('data.csv', index=False)
#print(df)
Output:
Rank Rider Rider No. ... Gap B P
0 1 REMCO EVENEPOEL 134 ... - B : 10'' -
1 2 ENRIC MAS 124 ... + 00h 00' 02'' B : 6'' -
2 3 ROBERT GESINK 4 ... + 00h 00' 02'' B : 4'' -
3 4 JAI HINDLEY 44 ... + 00h 00' 13'' - -
4 5 THYMEN ARENSMAN 151 ... + 00h 00' 13'' - -
.. ... ... ... ... ... ... ..
129 130 KENNY ELISSONDE 163 ... + 00h 33' 24'' - -
130 131 DAVIDE CIMOLAI 53 ... + 00h 33' 31'' - -
131 132 ALEX KIRSCH 165 ... + 00h 35' 17'' - -
132 133 MADS PEDERSEN 167 ... + 00h 35' 17'' - -
133 134 IVO MANUEL ALVES OLIVEIRA 173 ... + 00h 35' 48'' - -
[134 rows x 8 columns]
固定縮進,改動很小
import requests
from bs4 import BeautifulSoup
import pandas as pd
rider_rank_list = []
for j in range (1,10):
url = (f"https://www.lavuelta.es/en/rankings/stage-{j}")
page = requests.get(url)
urlt = page.content
soup = BeautifulSoup(urlt)
for i in range (1,11):
#create list of riders
results = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td.runner.is-sticky > a ")
if results != None:
#create rider rank list
rrank = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td:nth-child(1)")
#create stage name
stage = str.replace(str.title(url.rsplit('/', 1)[-1]),'-',' ')
rider_rank_list.append((str(stage),str.strip(results.text), str.strip(rrank.text)))
print(rider_rank_list)
df = pd.DataFrame(rider_rank_list, columns=['stage','rider','rank'], index=None)
print(df)
df.to_csv('data.csv', index=False)
受到其他答案的啟發,並為您的問題添加了更簡單且可能更具可讀性的表格格式解決方案:
import pandas as pd
al=pd.DataFrame()
for i in range(2,19): # Stage only started from 2 to 18
url = f"https://www.lavuelta.es/en/rankings/stage-{i}"
df=pd.read_html(url)[0]
# Taking top 10 rider ie. top 10 ranked riders only
df=df[["Rider"]][:10]
# Renaming using "Rider" with stage number
df.columns=[f"Stage - {i} - Rider"]
# Adding all Rider column horizontally
al=pd.concat([al,df],axis=1)
al.to_csv('data.csv', index=False)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.