简体   繁体   English

包含Java脚本的Python Beautiful Soup scape页面

[英]Python Beautiful Soup scape page containing Java Script

I am trying to scrape from this page: http://www.scoresway.com/?sport=basketball&page=match&id=45926 我正在尝试从此页面抓取: http : //www.scoresway.com/? sport=basketball&page = match& id=45926

but having trouble getting some of the data. 但在获取某些数据时遇到了麻烦。

The second table on the page contains the home team boxscore. 页面上的第二个表格包含主队得分。 The boxscore is split between 'basic' and 'advanced' stats. boxscore在“基本”和“高级”统计信息之间划分。 This code prints the 'basic' total stats for the home team. 这段代码显示了主队的“基本”总数据。

from BeautifulSoup import BeautifulSoup
import requests

gameId = 45926
url = 'http://www.scoresway.com/?sport=basketball&page=match&id=' + str(gameId)
r = requests.get(url)
soup = BeautifulSoup(r.content)

for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
    print ''.join(x.findAll(text=True))

If you want to see the 'advanced' stats you click on the Advanced 'link' and it shows it while keeping you on the same page. 如果您想查看“高级”统计信息,请单击“高级”链接,它会显示在您保持在同一页面上的状态。 I want to scrape that info as well but don't know how to get at it. 我也想抓取该信息,但不知道如何获取。

There is a separate request going for the advanced tab. 对于advanced选项卡,有一个单独的请求。 Simulate it and parse with BeautifulSoup . 模拟它并使用BeautifulSoup解析。

For example, here's the code that prints all of the players in the table: 例如,以下是打印表中所有玩家的代码:

import requests
from bs4 import BeautifulSoup


ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id=45926&sport=basketball&localization_id=www"

response = requests.get(ADVANCED_URL)
soup = BeautifulSoup(response.text)
print [td.text.strip() for td in soup('td', class_='name')]

Prints: 打印:

[u'T. Chandler  *', 
 u'K. Durant  *', 
 u'L. James  *',
 u'R. Westbrook',
 ...
 u'C. Anthony']

If you look at the ADVANCED_URL , you'll see that the only "dynamic" part of the url GET parameters are match_id and sport parameters. 如果查看ADVANCED_URL ,您会看到url GET参数中唯一的“动态”部分是match_idsport参数。 If you need to make the code reusable and applicable for other pages like this on the web-site, you would need to dynamically fill match_id and sport . 如果您需要使代码可重用并且适用于该网站上的其他页面,则需要动态填充match_idsport Example implementation: 示例实现:

from bs4 import BeautifulSoup
import requests

BASE_URL = 'http://www.scoresway.com/?sport={sport}&page=match&id={match_id}'
ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id={match_id}&sport={sport}&localization_id=www"


def get_match(sport, match_id):
    # basic
    r = requests.get(BASE_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(r.content)

    for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
        print ''.join(x.findAll(text=True))

    # advanced
    response = requests.get(ADVANCED_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(response.text)
    print [td.text.strip() for td in soup('td', class_='name')]


get_match('basketball', 45926)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM