簡體   English   中英

抓取時激活按鈕進入下一頁(Python,BeautifulSoup)

[英]Activate button to get to next page while scraping (Python, BeautifulSoup)

我嘗試構建 Fifa 2020 球員的數據集。 我剛剛進入 web 刮擦 Python BeatifulSoup。 所以我想從這個網站上抓取: https://sofifa.com/?r=200061&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B% 5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo&showCol%5B%5D=pi至此,我能夠得到我想要的內容。 但我有一個問題,該網站顯示前 60 名玩家,然后有一個“下一個”按鈕,我不知道如何激活它以繼續在下一頁上抓取。 我想獲取所有玩家的數據。

這是我到目前為止所擁有的:

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# create dataframe to store data
column_names = ["Name", "Age", "Overall Rating", "Potential", "Team", "Contract expiry", "Height", "Weight", "Strong foot", "Value"] 
df = pd.DataFrame(columns = column_names)


headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://sofifa.com/?r=200054&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("a", {"class": "tooltip"})
Age = pageSoup.find_all("td", {"class": "col col-ae"})
OR = pageSoup.find_all("td", {"class": "col col-oa col-sort"})
PR = pageSoup.find_all("td", {"class": "col col-pt"})
Team = pageSoup.find_all("div", {"class": "bp3-text-overflow-ellipsis"})
contract = pageSoup.find_all("div", {"class": "sub"})
height = pageSoup.find_all("td", {"class": "col col-hi"})
weight = pageSoup.find_all("td", {"class": "col col-wi"})
PF = pageSoup.find_all("td", {"class": "col col-pf"})
Value = pageSoup.find_all("td", {"class": "col col-vl"})


Players_List = []
Age_List = []
OR_List = []
PR_List = []
Team_List = []
contract_List = []
height_List = []
weight_List = []
PF_List = []
Value_List = []

j = 1

for i in range(0,60):
    Players_List.append(Players[i].text)
    Age_List.append(Age[i].text)
    OR_List.append(OR[i].text)
    PR_List.append(PR[i].text)
    Team_List.append(Team[i+j].text)
    contract_List.append(contract[i].text)
    height_List.append(height[i].text)
    weight_List.append(weight[i].text)
    PF_List.append(PF[i].text)
    Value_List.append(Value[i].text)
    j=j+1
df = pd.DataFrame({"Name":Players_List, "Age": Age_List, "Overall Rating":OR_List, "Potential":PR_List, "Team":Team_List, "Contract expiry":contract_List, "Height":height_List,"Weight":weight_List, "Strong foot":PF_List, "Value":Value_List})

希望有人能在這里幫助我

我注意到鏈接末尾有一個offset ,因此您可以像這樣編輯代碼而無需使用selenium

number_of_pages = 10
page = "https://sofifa.com/?r=200061&set=true&showCol[]=ae&showCol[]=oa&showCol[]=pt&showCol[]=vl&showCol[]=hi&showCol[]=wi&showCol[]=pf&showCol[]=bo&showCol[]=pi&offset="
for num_page in range(0, 10):
    pageTree = requests.get(page+str(num_page*60), headers=headers)
    """
        Rest of the code
    """

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM