简体   繁体   中英

web-scraping a strange html setup with Python-BeautifulSoup & urllib

The problem is not really extracting the data, but locating it. I am scraping for football data. This site lays it out in total(all years) or year(season), however the data contained in the html is the data about all time , not the season you select, even though the site displays the season statistic's. Interestingly when you load data for a season, it first loads and briefly displays the data for all time, of that variable. For example: line within the "td" tags on line 983 of the html source for this site , it says 515(Chelsea's wins for all time) when I'm viewing the page for Chelsea's wins that season, which should be 26. Can anyone explain this witchcraft and how to scrape data by season?

Looks like when you select a season, they pull from an API that returns the data in JSON format. This makes your job a lot easier because JSON is easier to parse than HTML.

You can see the requests and responses in Chrome web dev tools:

  • Press F12 when looking at the page in Chrome.
  • Go to the Network tab.
  • Click the Filter icon, then click XHR.

在此输入图像描述

When you choose a season you should see an XHR request to footballapi.pulselive.com.

For example https://footballapi.pulselive.com/football/stats/ranked/teams/wins?page=0&pageSize=20&compSeasons=42&comps=1&altIds=true

Click on that URL in the dev tools and to the right, click the Preview tab to see the response formatted nicely.

I think you'll be able to mimic these requests in your program. You may need to send some of the same request headers because it appears they block it if you try to hit the API directly in the browser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM