简体   繁体   English

用Python-BeautifulSoup和urllib抓取一个奇怪的html设置

[英]web-scraping a strange html setup with Python-BeautifulSoup & urllib

The problem is not really extracting the data, but locating it. 问题不是真正提取数据,而是定位数据。 I am scraping for football data. 我在抓足球数据。 This site lays it out in total(all years) or year(season), however the data contained in the html is the data about all time , not the season you select, even though the site displays the season statistic's. 该网站总共(全年)或年份(季节)列出,但html中包含的数据是关于所有时间的数据,而不是您选择的季节,即使该网站显示季节统计数据。 Interestingly when you load data for a season, it first loads and briefly displays the data for all time, of that variable. 有趣的是,当您加载一个季节的数据时,它首先加载并简要显示该变量的所有时间的数据。 For example: line within the "td" tags on line 983 of the html source for this site , it says 515(Chelsea's wins for all time) when I'm viewing the page for Chelsea's wins that season, which should be 26. Can anyone explain this witchcraft and how to scrape data by season? 例如:在该网站的html源代码的第983行的“td”标签内的行,它显示515(切尔西有史以来的胜利)当我查看该赛季切尔西胜利的页面时,应该是26。任何人解释这个巫术以及如何按季节刮取数据?

Looks like when you select a season, they pull from an API that returns the data in JSON format. 看起来当你选择一个季节时,他们会从一个以JSON格式返回数据的API中提取。 This makes your job a lot easier because JSON is easier to parse than HTML. 这使您的工作变得更加容易,因为JSON比HTML更容易解析。

You can see the requests and responses in Chrome web dev tools: 您可以在Chrome网络开发工具中查看请求和回复:

  • Press F12 when looking at the page in Chrome. 查看Chrome中的页面时按F12键。
  • Go to the Network tab. 转到“网络”选项卡。
  • Click the Filter icon, then click XHR. 单击“过滤器”图标,然后单击“XHR”。

在此输入图像描述

When you choose a season you should see an XHR request to footballapi.pulselive.com. 当您选择一个季节时,您应该看到对footballapi.pulselive.com的XHR请求。

For example https://footballapi.pulselive.com/football/stats/ranked/teams/wins?page=0&pageSize=20&compSeasons=42&comps=1&altIds=true 例如https://footballapi.pulselive.com/football/stats/ranked/teams/wins?page=0&pageSize=20&compSeasons=42&comps=1&altIds=true

Click on that URL in the dev tools and to the right, click the Preview tab to see the response formatted nicely. 单击开发工具中的该URL,然后单击“预览”选项卡以查看格式良好的响应。

I think you'll be able to mimic these requests in your program. 我想你将能够在你的程序中模仿这些请求。 You may need to send some of the same request headers because it appears they block it if you try to hit the API directly in the browser. 您可能需要发送一些相同的请求标头,因为如果您尝试直接在浏览器中访问API,它们似乎会阻止它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM