简体   繁体   中英

Python Beautiful Soup scape page containing Java Script

I am trying to scrape from this page: http://www.scoresway.com/?sport=basketball&page=match&id=45926

but having trouble getting some of the data.

The second table on the page contains the home team boxscore. The boxscore is split between 'basic' and 'advanced' stats. This code prints the 'basic' total stats for the home team.

from BeautifulSoup import BeautifulSoup
import requests

gameId = 45926
url = 'http://www.scoresway.com/?sport=basketball&page=match&id=' + str(gameId)
r = requests.get(url)
soup = BeautifulSoup(r.content)

for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
    print ''.join(x.findAll(text=True))

If you want to see the 'advanced' stats you click on the Advanced 'link' and it shows it while keeping you on the same page. I want to scrape that info as well but don't know how to get at it.

There is a separate request going for the advanced tab. Simulate it and parse with BeautifulSoup .

For example, here's the code that prints all of the players in the table:

import requests
from bs4 import BeautifulSoup


ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id=45926&sport=basketball&localization_id=www"

response = requests.get(ADVANCED_URL)
soup = BeautifulSoup(response.text)
print [td.text.strip() for td in soup('td', class_='name')]

Prints:

[u'T. Chandler  *', 
 u'K. Durant  *', 
 u'L. James  *',
 u'R. Westbrook',
 ...
 u'C. Anthony']

If you look at the ADVANCED_URL , you'll see that the only "dynamic" part of the url GET parameters are match_id and sport parameters. If you need to make the code reusable and applicable for other pages like this on the web-site, you would need to dynamically fill match_id and sport . Example implementation:

from bs4 import BeautifulSoup
import requests

BASE_URL = 'http://www.scoresway.com/?sport={sport}&page=match&id={match_id}'
ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id={match_id}&sport={sport}&localization_id=www"


def get_match(sport, match_id):
    # basic
    r = requests.get(BASE_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(r.content)

    for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
        print ''.join(x.findAll(text=True))

    # advanced
    response = requests.get(ADVANCED_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(response.text)
    print [td.text.strip() for td in soup('td', class_='name')]


get_match('basketball', 45926)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM