簡體   English   中英

使用 Python 循環通過 URL

[英]Loop through URL using Python

我看了幾個問題,但似乎沒有一個答案適合。 我正在構建一個 webscraper 工具作為個人項目。 我已經找到了獲取 Vuelta 2022 車手數據的循環,但是我需要遍歷每個階段的所有 url。 出於某種原因,url 循環采用范圍內的最后一個數字。 我的直覺是格式,所以我試圖玩弄它,但沒有運氣

import requests
from bs4 import BeautifulSoup
import pandas as pd

for j in range (1,10):
    url = (f"https://www.lavuelta.es/en/rankings/stage-{j}")
    page = requests.get(url)
    urlt = page.content
    soup = BeautifulSoup(urlt)
    rider_rank_list = []
for i in range (1,11):
#create list of riders
    results = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td.runner.is-sticky > a ")

        
#create rider rank list
    rrank = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td:nth-child(1)")


#create stage name
    stage = str.replace(str.title(url.rsplit('/', 1)[-1]),'-',' ')

    rider_rank_list.append((str(stage),str.strip(results.text), str.strip(rrank.text)))


    
print(rider_rank_list)
df = pd.DataFrame(rider_rank_list, columns=['stage','rider','rank'], index=None)
print(df)

df.to_csv('data.csv', index=False)


所有數據都在一個表中。 所以沒有下一頁選項。 您只能使用 pandas DataFrame,因為它們是 static Z4C4AD5FCA2E7A3AAF74DBB1CED0038

import pandas as pd
url = "https://www.lavuelta.es/en/rankings/stage#"
df = pd.read_html(url)[0].to_csv('data.csv', index=False)
#print(df)

Output:

    Rank                      Rider  Rider No.  ...             Gap         B  P
0       1            REMCO EVENEPOEL        134  ...               -  B : 10''  -
1       2                  ENRIC MAS        124  ...  + 00h 00' 02''   B : 6''  -
2       3              ROBERT GESINK          4  ...  + 00h 00' 02''   B : 4''  -
3       4                JAI HINDLEY         44  ...  + 00h 00' 13''         -  -
4       5            THYMEN ARENSMAN        151  ...  + 00h 00' 13''         -  -
..    ...                        ...        ...  ...             ...       ... ..
129   130            KENNY ELISSONDE        163  ...  + 00h 33' 24''         -  -
130   131             DAVIDE CIMOLAI         53  ...  + 00h 33' 31''         -  -
131   132                ALEX KIRSCH        165  ...  + 00h 35' 17''         -  -
132   133              MADS PEDERSEN        167  ...  + 00h 35' 17''         -  -
133   134  IVO MANUEL ALVES OLIVEIRA        173  ...  + 00h 35' 48''         -  -

[134 rows x 8 columns]

固定縮進,改動很小

import requests
from bs4 import BeautifulSoup
import pandas as pd

rider_rank_list = []

for j in range (1,10):
    url = (f"https://www.lavuelta.es/en/rankings/stage-{j}")
    page = requests.get(url)
    urlt = page.content
    soup = BeautifulSoup(urlt)
    
    for i in range (1,11):
        #create list of riders
        results = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td.runner.is-sticky > a ")

        if results != None: 
        
            #create rider rank list
            rrank = soup.select_one(f"body > main > div > section.ranking.classements > div > div > div.js-tabs-wrapper.js-tabs-bigwrapper > div > div > div > div > div.js-spinner-wrapper > div > div.sticky-scroll > table > tbody > tr:nth-child({i}) > td:nth-child(1)")

            #create stage name
            stage = str.replace(str.title(url.rsplit('/', 1)[-1]),'-',' ')
        
            rider_rank_list.append((str(stage),str.strip(results.text), str.strip(rrank.text)))


    
print(rider_rank_list)
df = pd.DataFrame(rider_rank_list, columns=['stage','rider','rank'], index=None)
print(df)

df.to_csv('data.csv', index=False)

受到其他答案的啟發,並為您的問題添加了更簡單且可能更具可讀性的表格格式解決方案:

import pandas as pd

al=pd.DataFrame()

for i in range(2,19): # Stage only started from 2 to 18
    url = f"https://www.lavuelta.es/en/rankings/stage-{i}"
    df=pd.read_html(url)[0]

    # Taking top 10 rider ie. top 10 ranked riders only
    df=df[["Rider"]][:10]

    # Renaming using "Rider" with stage number
    df.columns=[f"Stage - {i} - Rider"]

    # Adding all Rider column horizontally
    al=pd.concat([al,df],axis=1)


al.to_csv('data.csv', index=False)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM