简体   繁体   中英

Activate button to get to next page while scraping (Python, BeautifulSoup)

I try to build a dataset of Fifa 2020 players. I'm just getting into web scraping with Python BeatifulSoup. So I wanted to scrape from this website: https://sofifa.com/?r=200061&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo&showCol%5B%5D=pi So far, I'm able to get the content I want. But I have the issue that the website shows the first 60 players and then there is a "next" button and I don't know how to activate it to continue scraping on the next page. I want to get the data of all the players.

This is what I have so far:

import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# create dataframe to store data
column_names = ["Name", "Age", "Overall Rating", "Potential", "Team", "Contract expiry", "Height", "Weight", "Strong foot", "Value"] 
df = pd.DataFrame(columns = column_names)


headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

page = "https://sofifa.com/?r=200054&set=true&showCol%5B%5D=ae&showCol%5B%5D=oa&showCol%5B%5D=pt&showCol%5B%5D=vl&showCol%5B%5D=hi&showCol%5B%5D=wi&showCol%5B%5D=pf&showCol%5B%5D=bo"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'html.parser')

Players = pageSoup.find_all("a", {"class": "tooltip"})
Age = pageSoup.find_all("td", {"class": "col col-ae"})
OR = pageSoup.find_all("td", {"class": "col col-oa col-sort"})
PR = pageSoup.find_all("td", {"class": "col col-pt"})
Team = pageSoup.find_all("div", {"class": "bp3-text-overflow-ellipsis"})
contract = pageSoup.find_all("div", {"class": "sub"})
height = pageSoup.find_all("td", {"class": "col col-hi"})
weight = pageSoup.find_all("td", {"class": "col col-wi"})
PF = pageSoup.find_all("td", {"class": "col col-pf"})
Value = pageSoup.find_all("td", {"class": "col col-vl"})


Players_List = []
Age_List = []
OR_List = []
PR_List = []
Team_List = []
contract_List = []
height_List = []
weight_List = []
PF_List = []
Value_List = []

j = 1

for i in range(0,60):
    Players_List.append(Players[i].text)
    Age_List.append(Age[i].text)
    OR_List.append(OR[i].text)
    PR_List.append(PR[i].text)
    Team_List.append(Team[i+j].text)
    contract_List.append(contract[i].text)
    height_List.append(height[i].text)
    weight_List.append(weight[i].text)
    PF_List.append(PF[i].text)
    Value_List.append(Value[i].text)
    j=j+1
df = pd.DataFrame({"Name":Players_List, "Age": Age_List, "Overall Rating":OR_List, "Potential":PR_List, "Team":Team_List, "Contract expiry":contract_List, "Height":height_List,"Weight":weight_List, "Strong foot":PF_List, "Value":Value_List})

Hope someone can help me here

I noticed there's an offset at the end of the link, so you can edit your code like this without the need to use selenium :

number_of_pages = 10
page = "https://sofifa.com/?r=200061&set=true&showCol[]=ae&showCol[]=oa&showCol[]=pt&showCol[]=vl&showCol[]=hi&showCol[]=wi&showCol[]=pf&showCol[]=bo&showCol[]=pi&offset="
for num_page in range(0, 10):
    pageTree = requests.get(page+str(num_page*60), headers=headers)
    """
        Rest of the code
    """

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM