简体   繁体   中英

How to use Python (preferably pandas) to scrape data from Javascript table?

I am using pandas to grab some ice hockey stats from a web page as shown below:

import pandas as pd

url_goal = 'http://www.quanthockey.com/nhl/records/nhl-players-all-time-goals-per-game-leaders.html'
df_goal = pd.read_html(url_goal, index_col=0, header=0)[0]

This works great, but the problem is that switching to the second page of the stats table on the homepage, does not change the url, so I cannot use the same approach to grab more than the top 50 players. There is a javascript address to the table that does change as the page number switches. I read a little about selenium and beautifulsoup, but I don't have these installed so I would prefer to do it without them is possible. So my question is two-fold:

  1. Is there any way to grab data from the different pages in this javascript table using only pandas and standard Python/SciPy libraries (Anaconda to be exact)?

  2. If not, how would you go about getting this data into a pandas data frame with the help of selenium or your package of choice?

Hint: Open the network analyzer in your browser and watch what happens when you navigate to different pages; you'll notice a GET request to a page like

http://www.quanthockey.com/scripts/AjaxPaginate.php?cat=Records&pos=Players&SS=&af=0&nat=alltime&st=reg&sort=goals-per-game&page=3&league=NHL&lang=en&rnd=451318572

Notice the page part of the query string.

You can just iterate through the range of numbers corresponding to how many pages there are, changing the query string page parameter, increasing it by one each time (for example)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM