简体   繁体   English

使用Python抓取Web数据

[英]Scraping Web data with Python

sorry if this is not the place for this question, but I'm not sure where else to ask. 很抱歉,如果这不是这个问题的地方,但是我不确定还有什么要问的。

I'm trying to scrape data from rotogrinders.com and I'm running into some challenges. 我正在尝试从rotogrinders.com抓取数据,并且遇到了一些挑战。

In particular, I want to be able to scrape previous NHL game data using urls of this format (obviously you can change the date for other day's data): https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016 特别是,我希望能够使用此格式的网址抓取以前的NHL游戏数据(显然,您可以更改其他日期的数据的日期): https : //rotogrinders.com/game-stats/nhl-skater?site= draftkings&date = 11-22-2016

However, when I get to the page, I notice that the data is broken up into pages, and I'm unsure what to do to get my script to get the data that's presented after clicking the "all" button at the bottom of the page. 但是,当我进入页面时,我注意到数据被分解成页面,并且我不确定该怎么做才能让我的脚本获取单击在页面底部的“全部”按钮后显示的数据。页。

Is there a way to do this in python? 有没有办法在python中做到这一点? Perhaps some library that will allow button clicks? 也许某些库将允许单击按钮? Or is there some way to get the data without actually clicking the button by being clever about the URL/request? 还是有某种方法可以通过巧妙地了解URL /请求而无需实际单击按钮即可获取数据?

Actually, things are not that complicated in this case. 实际上,在这种情况下,事情并没有那么复杂。 When you click "All" no network requests are issued . 当您单击“全部”时, 不会发出网络请求 All the data is already there - inside a script tag in the HTML, you just need to extract it. 所有数据都已经存在 -在HTML的script标签内,您只需提取它即可。

Working code using requests (to download the page content), BeautifulSoup (to parse HTML and locate the desired script element), re (to extract the desired "player" array from the script) and json (to load the array string into a Python list): 使用requests (下载页面内容), BeautifulSoup (解析HTML并找到所需的script元素), re (从脚本中提取所需的“玩家”数组)和json (将数组字符串加载到Python)中的工作代码清单):

import json
import re

import requests
from bs4 import BeautifulSoup

url = "https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016"
response = requests.get(url)

soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"var data = (\[.*?\]);$", re.MULTILINE | re.DOTALL)

script = soup.find("script", text=pattern)

data = pattern.search(script.text).group(1)
data = json.loads(data)

# printing player names for demonstration purposes
for player in data:
    print(player["player"])

Prints: 印刷品:

Jeff Skinner
Jordan Staal
...
William Carrier
A.J. Greer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM