[英]Python Beautiful Soup scape page containing Java Script
I am trying to scrape from this page: http://www.scoresway.com/?sport=basketball&page=match&id=45926 我正在尝试从此页面抓取: http : //www.scoresway.com/? sport=basketball&page = match& id=45926
but having trouble getting some of the data. 但在获取某些数据时遇到了麻烦。
The second table on the page contains the home team boxscore. 页面上的第二个表格包含主队得分。 The boxscore is split between 'basic' and 'advanced' stats.
boxscore在“基本”和“高级”统计信息之间划分。 This code prints the 'basic' total stats for the home team.
这段代码显示了主队的“基本”总数据。
from BeautifulSoup import BeautifulSoup
import requests
gameId = 45926
url = 'http://www.scoresway.com/?sport=basketball&page=match&id=' + str(gameId)
r = requests.get(url)
soup = BeautifulSoup(r.content)
for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
print ''.join(x.findAll(text=True))
If you want to see the 'advanced' stats you click on the Advanced 'link' and it shows it while keeping you on the same page. 如果您想查看“高级”统计信息,请单击“高级”链接,它会显示在您保持在同一页面上的状态。 I want to scrape that info as well but don't know how to get at it.
我也想抓取该信息,但不知道如何获取。
There is a separate request going for the advanced
tab. 对于
advanced
选项卡,有一个单独的请求。 Simulate it and parse with BeautifulSoup
. 模拟它并使用
BeautifulSoup
解析。
For example, here's the code that prints all of the players in the table: 例如,以下是打印表中所有玩家的代码:
import requests
from bs4 import BeautifulSoup
ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id=45926&sport=basketball&localization_id=www"
response = requests.get(ADVANCED_URL)
soup = BeautifulSoup(response.text)
print [td.text.strip() for td in soup('td', class_='name')]
Prints: 打印:
[u'T. Chandler *',
u'K. Durant *',
u'L. James *',
u'R. Westbrook',
...
u'C. Anthony']
If you look at the ADVANCED_URL
, you'll see that the only "dynamic" part of the url GET parameters are match_id
and sport
parameters. 如果查看
ADVANCED_URL
,您会看到url GET参数中唯一的“动态”部分是match_id
和sport
参数。 If you need to make the code reusable and applicable for other pages like this on the web-site, you would need to dynamically fill match_id
and sport
. 如果您需要使代码可重用并且适用于该网站上的其他页面,则需要动态填充
match_id
和sport
。 Example implementation: 示例实现:
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://www.scoresway.com/?sport={sport}&page=match&id={match_id}'
ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id={match_id}&sport={sport}&localization_id=www"
def get_match(sport, match_id):
# basic
r = requests.get(BASE_URL.format(sport=sport, match_id=match_id))
soup = BeautifulSoup(r.content)
for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
print ''.join(x.findAll(text=True))
# advanced
response = requests.get(ADVANCED_URL.format(sport=sport, match_id=match_id))
soup = BeautifulSoup(response.text)
print [td.text.strip() for td in soup('td', class_='name')]
get_match('basketball', 45926)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.