I'm trying to retrieve data from the “Advanced Box Score Stats" from the following webpage: http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html
I tried using BeautifulSoup in a very broad way to retrieve all the tables:
import requests
from bs4 import BeautifulSoup
base_url = "http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html"
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
tables = soup.find_all("table")
for table in tables:
print table.get_text()
In doing so, it only retrieved the “Basic Box Score Stats”. However, it didn't retrieve the “Advanced Box Score Stats” like I had hoped.
Next, I tried getting more specific by using the lxml path:
import requests
from lxml import html
page = requests.get('http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html')
tree = html.fromstring(page.content)
boxscore_Advanced = tree.xpath('//*[@id="box-score-advanced-lafayette"]/tbody/tr[1]/td[1]/text()’)
print boxscore_Advanced
In doing so, it returned an empty list.
I've been struggling with this for a good amount of time, and have tried to solve this problem by using the following posts:
Thank you in advance for any and all help!
There is no need to use selenium
and/or PhantomJS
. The "Advanced Box Score Stats" tables are actually inside the HTML, they are just inside HTML comments . Parse them:
import requests
from bs4 import BeautifulSoup, Comment
url = "http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# find the comments containing the desired tables
tables = soup.find_all(text=lambda text: text and isinstance(text, Comment) and 'Advanced Box Score Stats' in text)
# we have 2 tables - one for an opponent team
for table in tables:
table_soup = BeautifulSoup(table, "html.parser")
advanced_table = table_soup.select_one("table[id^=box-score-advanced]")
for row in advanced_table("tr")[2:]: # skip headers
print(row.th.get_text())
print("-------")
Prints the player names from the first columns of the advanced tables:
Nick Lindner
Monty Boykins
Matt Klinewski
Paulius Zalys
Auston Evans
Reserves
Myles Cherry
Kyle Stout
Eric Stafford
Lukas Jarrett
Hunter Janacek
Jimmy Panzini
School Totals
-------
Kris Jenkins
Phil Booth
Josh Hart
Jalen Brunson
Darryl Reynolds
Reserves
Donte DiVincenzo
Mikal Bridges
Eric Paschall
Tim Delaney
Dylan Painter
Denny Grace
Tom Leibig
Matt Kennedy
School Totals
-------
@snakecharmerb is on the right path: this table does not exist in the raw html and must be being added by Javascript at runtime.
Do this:
$ curl http://www.sports-reference.com/cbb/boxscores/2016-11-11-villanova.html | grep "box-score-advanced-lafayette"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9891 0 9891 0 0 45371 0 --:--:-- --:--:-- --:--:-- 48965<div id="all_box-score-advanced-lafayette" class="table_wrapper setup_commented commented">
<span class="section_anchor" id="box-score-advanced-lafayette_link" data-label="Advanced Box Score"></span>
<div class="overthrow table_container" id="div_box-score-advanced-lafayette">
<table class="sortable stats_table" id="box-score-advanced-lafayette" data-cols-to-freeze=1><caption> Table</caption>
100 141k 0 141k 0 0 349k 0 --:--:-- --:--:-- --:--:-- 363k
You see from the output that all that exists in the html is the container that the table gets built in.
For scraping something like this, I recommend an approach like Phantom.js http://phantomjs.org
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.