简体繁体中英

web-scraping a strange html setup with Python-BeautifulSoup & urllib

原文 2017-05-20 00:33:26 4 1 javascript/ html5/ python-3.x/ web-scraping/ beautifulsoup

The problem is not really extracting the data, but locating it. I am scraping for football data. This site lays it out in total(all years) or year(season), however the data contained in the html is the data about all time , not the season you select, even though the site displays the season statistic's. Interestingly when you load data for a season, it first loads and briefly displays the data for all time, of that variable. For example: line within the "td" tags on line 983 of the html source for this site , it says 515(Chelsea's wins for all time) when I'm viewing the page for Chelsea's wins that season, which should be 26. Can anyone explain this witchcraft and how to scrape data by season?

1 answers

Looks like when you select a season, they pull from an API that returns the data in JSON format. This makes your job a lot easier because JSON is easier to parse than HTML.

You can see the requests and responses in Chrome web dev tools:

Press F12 when looking at the page in Chrome.
Go to the Network tab.
Click the Filter icon, then click XHR.

When you choose a season you should see an XHR request to footballapi.pulselive.com.

For example https://footballapi.pulselive.com/football/stats/ranked/teams/wins?page=0&pageSize=20&compSeasons=42&comps=1&altIds=true

Click on that URL in the dev tools and to the right, click the Preview tab to see the response formatted nicely.

I think you'll be able to mimic these requests in your program. You may need to send some of the same request headers because it appears they block it if you try to hit the API directly in the browser.

Python Web-Scraping data that's not hard-coded into the HTML

R web-scraping - hidden text in HTML

Python 3, Web-scraping, and Javascript [Oh My]

web-scraping hidden href using python

Web-scraping JavaScript page with Python

Python Web-scraping, How to click 'Next' using Requests-HTML library

Logging into website that doesnt use a POST request - web-scraping with Python

Functions and Web-scraping with Puppeteer

Dynamic Data Web Scraping with Python, BeautifulSoup

Python BeautifulSoup scraping web page that has protection

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Python Web-Scraping data that's not hard-coded into the HTML R web-scraping - hidden text in HTML Python 3, Web-scraping, and Javascript [Oh My] web-scraping hidden href using python Web-scraping JavaScript page with Python Python Web-scraping, How to click 'Next' using Requests-HTML library Logging into website that doesnt use a POST request - web-scraping with Python Functions and Web-scraping with Puppeteer Dynamic Data Web Scraping with Python, BeautifulSoup Python BeautifulSoup scraping web page that has protection

Related Tags

web-scraping a strange html setup with Python-BeautifulSoup & urllib

Question

1 answers

solution1 0 2017-05-20 01:19:46

solution1
0 2017-05-20 01:19:46