I'm trying to scrape a page like this one
What they do is to load all information from their server and store it in a javascript function, so that depending on which button you click, it loads one part or another. I was trying to just request the page, and get all the data from the script function, being the structure of the page something like this
<!DOCTYPE html>
<html lang="en" xmlns:wb="http://open.weibo.com/wb">
<head>
<meta charset="utf-8">
<title>Historical Statistics of Kristiansund BK vs Molde on 2020/07/03 - ScoreBing</title>
#Several script tags over here....
</head>
<body class="vEn">
#Some stuff here...
#This is where the buttons that deploy the data are
<div class="panel-body">
<div id="live-filter-bar">
<div class="row MBTitle">
<div class="small-6 columns PL0">
<a href="javascript:set_type(1);" id="tabtypeid1" class="button tiny radius MB0 MRMini VM font-bold">All</a>
<a href="javascript:set_type(2);" id="tabtypeid2" class="button tiny radius action MB0 MRMini VM">This League</a>
<a href="javascript:" onClick="select(1)" id="tabid1" class="button tiny radius MB0 MRMini VM">All</a>
<a href="javascript:" onClick="select(2)" id="tabid2" class="button tiny radius action MB0 MRMini VM">HA</a>
<a href="javascript:" onClick="select(3)" id="tabid3" class="button tiny radius action MB0 MRMini VM">AH</a>
<a href="javascript:" onClick="select(4)" id="tabid4" class="button tiny radius action MB0 MRMini VM">HH</a>
<a href="javascript:" onClick="select(5)" id="tabid5" class="button tiny radius action MB0 MRMini VM">AA</a>
</div>
<div class="small-6 columns text-right PR0">
<a href="javascript:set_num(10);" id=td10 class="button tiny radius action MB0 MRMini VM">Last 10</a>
<a href="javascript:set_num(8);" id=td8 class="button tiny radius action MB0 MRMini VM">Last 8</a>
<a href="javascript:set_num(6);" id=td6 class="button tiny radius MB0 MRMini VM">Last 6</a>
<a href="javascript:set_num(4);" id=td4 class="button tiny radius action MB0 MRMini VM">Last 4</a>
</div>
</div>
</div>
<div id="history_table">
</div>
<div id="history1">
</div>
<div id="history2">
</div>
</div>
</body>
</html>
<script type="text/javascript">
var kind=1,num=6,typenum=1;
var race=[],league_bgcolor=[],league_i= 1;
var race_have_corner_handicap=1;
var home_id = [];
var guest_id = [];
home_id.push(1405); guest_id.push(4503);
var sclass='',leaue_id=198;
var tongji_info=[];
var half_goal_av='-',goal_av='-',half_corner_av='-',corner_av='-';
var tmp_host_name,tmp_guest_name,tmp_league_name;
tmp_host_name = "Mjondalen";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[0]=[746711,198,'20/06/29 12:01','903',1410,tmp_host_name,'957',1405,tmp_guest_name,'0.0','2.5','11 ',tmp_league_name,'2' ,'1','0','0','3',' 3','1','2','0.0','5.5','1.0',1];
tmp_host_name = "Molde";
tmp_guest_name = "Stabaek";
tmp_league_name = "Norway Tippeligaen";
race[1]=[746712,198,'20/06/29 12:00','661',4503,tmp_host_name,'1162',1396,tmp_guest_name,'-1.0','3.0','10 ',tmp_league_name,'2' ,'1','1','0','5',' 3','4','1','-0.5','4.5','1.0,1.5',1];
tmp_host_name = "Haugesund";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[2]=[746167,198,'20/06/25 12:00','673',1390,tmp_host_name,'957',1405,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'1','0','1','8',' 3','2','2','0.0','5','1.0',1];
tmp_host_name = "IK Start";
tmp_guest_name = "Molde";
tmp_league_name = "Norway Tippeligaen";
race[3]=[746169,198,'20/06/25 12:00','667',1392,tmp_host_name,'661',4503,tmp_guest_name,'+1.0','3.0','10 ',tmp_league_name,'4' ,'3','1','2','6',' 8','2','3','+0.5','4.5','1.0,1.5',1];
tmp_host_name = "Kristiansund BK";
tmp_guest_name = "Aalesund";
tmp_league_name = "Norway Tippeligaen";
race[4]=[744697,198,'20/06/22 12:01','957',1405,tmp_host_name,'677',1321,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'2','3','2','6',' 2','7','2','0.0','5','1.0',1];
tmp_host_name = "Molde";
tmp_guest_name = "Rosenborg";
tmp_league_name = "Norway Tippeligaen";
race[5]=[744698,198,'20/06/21 02:30','661',4503,tmp_host_name,'1161',2482,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'0','0','0','9',' 4','1','0','0.0','5','1.0',1];
tmp_host_name = "Aalesund";
tmp_guest_name = "Molde";
tmp_league_name = "Norway Tippeligaen";
race[6]=[743531,198,'20/06/17 12:00','677',1321,tmp_host_name,'661',4503,tmp_guest_name,'0.0,+0.5','2.5,3.0','9.5 ',tmp_league_name,'8' ,'1','1','2','8',' 4','1','4','0.0,+0.5','4.5','1.0,1.5',1];
tmp_host_name = "Rosenborg";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[7]=[743533,198,'20/06/17 12:00','1161',2482,tmp_host_name,'957',1405,tmp_guest_name,'-1.0','2.5,3.0','11.5 ',tmp_league_name,'7' ,'4','0','0','8',' 9','0','0','0.0,-0.5','5.5','1.0',1];
And the script tag goes longer than the snippet. So, I have 2 problems. One, when I do response=requests.get(url=url)
and I do response.content
, I can see it only reaches until the end of html tag, so my script tag with all the data is not included. How do I include it with requests?
Second question, how do I scrape this, after I get it?
Well, it appears that it is simply a parser
setting that should be adjusted with BeautifulSoup:
import requests
from bs4 import BeautifulSoup
headers = {
'authority': 'www.scorebing.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9',
}
response = requests.get('https://www.scorebing.com/match_history/747514', headers=headers)
soup = BeautifulSoup(response.content, 'html.parser', encoding='UTF-8')
soup.find('script', text = re.compile('race_have_corner_handicap'))
Output
<script type="text/javascript">
var is_en = 1;
var kind=1,num=6,typenum=1;
var race=[],league_bgcolor=[],league_i= 1;
var race_have_corner_handicap=1;
var home_id = [];
var guest_id = [];
home_id.push(1405); guest_id.push(4503);
...
</script>
The page looks to be updated by a script after loading.
You can bypass this by using use Selenium instead of requests:
from selenium import webdriver
from bs4 import BeautifulSoup
import re
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)
driver.get("https://www.scorebing.com/match_history/747514")
soup = BeautifulSoup(driver.page_source)
#Find the script tag that contains specific text:
data = soup.find('script', text = re.compile('race_have_corner_handicap'))
print(data)
Output
<script type="text/javascript">
var is_en = 1;
var kind=1,num=6,typenum=1;
var race=[],league_bgcolor=[],league_i= 1;
var race_have_corner_handicap=1;
var home_id = [];
var guest_id = [];
home_id.push(1405); guest_id.push(4503);
...
</script>
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.