简体   繁体   English

如何使用Python(最好是pandas)从Javascript表中抓取数据?

[英]How to use Python (preferably pandas) to scrape data from Javascript table?

I am using pandas to grab some ice hockey stats from a web page as shown below: 我正在使用熊猫从网页上获取一些冰球统计数据,如下所示:

import pandas as pd

url_goal = 'http://www.quanthockey.com/nhl/records/nhl-players-all-time-goals-per-game-leaders.html'
df_goal = pd.read_html(url_goal, index_col=0, header=0)[0]

This works great, but the problem is that switching to the second page of the stats table on the homepage, does not change the url, so I cannot use the same approach to grab more than the top 50 players. 这很好用,但问题是切换到主页上统计表的第二页,不会更改网址,所以我不能使用相同的方法来获取超过前50名玩家。 There is a javascript address to the table that does change as the page number switches. 表格中有一个javascript地址,随着页码的切换而改变。 I read a little about selenium and beautifulsoup, but I don't have these installed so I would prefer to do it without them is possible. 我读了一些关于selenium和beautifulsoup的内容,但我没有安装这些,所以我更愿意在没有它们的情况下这样做。 So my question is two-fold: 所以我的问题是双重的:

  1. Is there any way to grab data from the different pages in this javascript table using only pandas and standard Python/SciPy libraries (Anaconda to be exact)? 有没有办法只使用pandas和标准的Python / SciPy库(准确地说是Anaconda)从这个javascript表中的不同页面获取数据?

  2. If not, how would you go about getting this data into a pandas data frame with the help of selenium or your package of choice? 如果没有,你会如何在selenium或你选择的包装的帮助下将这些数据放入熊猫数据框?

Hint: Open the network analyzer in your browser and watch what happens when you navigate to different pages; 提示:在浏览器中打开网络分析器,观察导航到不同页面时会发生什么; you'll notice a GET request to a page like 你会注意到对页面的GET请求

http://www.quanthockey.com/scripts/AjaxPaginate.php?cat=Records&pos=Players&SS=&af=0&nat=alltime&st=reg&sort=goals-per-game&page=3&league=NHL&lang=en&rnd=451318572

Notice the page part of the query string. 注意查询字符串的page部分。

You can just iterate through the range of numbers corresponding to how many pages there are, changing the query string page parameter, increasing it by one each time (for example) 您可以遍历对应于有多少页面的数字范围,更改查询字符串page参数,每次增加一个(例如)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM