简体   繁体   中英

Scraping <script> tag placed below </html> tag with python and requests

I'm trying to scrape a page like this one

What they do is to load all information from their server and store it in a javascript function, so that depending on which button you click, it loads one part or another. I was trying to just request the page, and get all the data from the script function, being the structure of the page something like this

<!DOCTYPE html>
<html lang="en" xmlns:wb="http://open.weibo.com/wb">
<head>
    <meta charset="utf-8">
        <title>Historical Statistics of Kristiansund BK vs Molde on 2020/07/03 - ScoreBing</title>

#Several script tags over here....
</head>

<body class="vEn">

#Some stuff here...

#This is where the buttons that deploy the data are
<div class="panel-body">
     <div id="live-filter-bar">
         <div class="row MBTitle">
               <div class="small-6 columns PL0">
                    <a href="javascript:set_type(1);" id="tabtypeid1" class="button tiny radius MB0 MRMini VM font-bold">All</a>
                    <a href="javascript:set_type(2);" id="tabtypeid2" class="button tiny radius action MB0 MRMini VM">This League</a>
                    <a href="javascript:" onClick="select(1)" id="tabid1" class="button tiny radius MB0 MRMini VM">All</a>
                    <a href="javascript:" onClick="select(2)" id="tabid2" class="button tiny radius action MB0 MRMini VM">HA</a>
                    <a href="javascript:" onClick="select(3)" id="tabid3" class="button tiny radius action MB0 MRMini VM">AH</a>
                    <a href="javascript:" onClick="select(4)" id="tabid4" class="button tiny radius action MB0 MRMini VM">HH</a>
                    <a href="javascript:" onClick="select(5)" id="tabid5" class="button tiny radius action MB0 MRMini VM">AA</a>
                </div>
                <div class="small-6 columns text-right PR0">
                    <a href="javascript:set_num(10);" id=td10 class="button tiny radius action MB0 MRMini VM">Last 10</a>
                    <a href="javascript:set_num(8);" id=td8 class="button tiny radius action MB0 MRMini VM">Last 8</a>
                    <a href="javascript:set_num(6);" id=td6 class="button tiny radius MB0 MRMini VM">Last 6</a>
                    <a href="javascript:set_num(4);" id=td4 class="button tiny radius action MB0 MRMini VM">Last 4</a>
                </div>
         </div>
     </div>
     <div id="history_table">

     </div>
     <div id="history1">

     </div>
     <div id="history2">

     </div>
</div>

</body>
</html>
<script type="text/javascript">

var kind=1,num=6,typenum=1;
var race=[],league_bgcolor=[],league_i= 1;
var race_have_corner_handicap=1;
var home_id = [];
var guest_id = [];
home_id.push(1405);    guest_id.push(4503);
var sclass='',leaue_id=198;
var tongji_info=[];
var half_goal_av='-',goal_av='-',half_corner_av='-',corner_av='-';
var tmp_host_name,tmp_guest_name,tmp_league_name;
        tmp_host_name = "Mjondalen";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[0]=[746711,198,'20/06/29 12:01','903',1410,tmp_host_name,'957',1405,tmp_guest_name,'0.0','2.5','11 ',tmp_league_name,'2' ,'1','0','0','3',' 3','1','2','0.0','5.5','1.0',1];
    tmp_host_name = "Molde";
tmp_guest_name = "Stabaek";
tmp_league_name = "Norway Tippeligaen";
race[1]=[746712,198,'20/06/29 12:00','661',4503,tmp_host_name,'1162',1396,tmp_guest_name,'-1.0','3.0','10 ',tmp_league_name,'2' ,'1','1','0','5',' 3','4','1','-0.5','4.5','1.0,1.5',1];
    tmp_host_name = "Haugesund";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[2]=[746167,198,'20/06/25 12:00','673',1390,tmp_host_name,'957',1405,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'1','0','1','8',' 3','2','2','0.0','5','1.0',1];
    tmp_host_name = "IK Start";
tmp_guest_name = "Molde";
tmp_league_name = "Norway Tippeligaen";
race[3]=[746169,198,'20/06/25 12:00','667',1392,tmp_host_name,'661',4503,tmp_guest_name,'+1.0','3.0','10 ',tmp_league_name,'4' ,'3','1','2','6',' 8','2','3','+0.5','4.5','1.0,1.5',1];
    tmp_host_name = "Kristiansund BK";
tmp_guest_name = "Aalesund";
tmp_league_name = "Norway Tippeligaen";
race[4]=[744697,198,'20/06/22 12:01','957',1405,tmp_host_name,'677',1321,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'2','3','2','6',' 2','7','2','0.0','5','1.0',1];
    tmp_host_name = "Molde";
tmp_guest_name = "Rosenborg";
tmp_league_name = "Norway Tippeligaen";
race[5]=[744698,198,'20/06/21 02:30','661',4503,tmp_host_name,'1161',2482,tmp_guest_name,'0.0,-0.5','2.5','10.5 ',tmp_league_name,'4' ,'0','0','0','9',' 4','1','0','0.0','5','1.0',1];
    tmp_host_name = "Aalesund";
tmp_guest_name = "Molde";
tmp_league_name = "Norway Tippeligaen";
race[6]=[743531,198,'20/06/17 12:00','677',1321,tmp_host_name,'661',4503,tmp_guest_name,'0.0,+0.5','2.5,3.0','9.5 ',tmp_league_name,'8' ,'1','1','2','8',' 4','1','4','0.0,+0.5','4.5','1.0,1.5',1];
    tmp_host_name = "Rosenborg";
tmp_guest_name = "Kristiansund BK";
tmp_league_name = "Norway Tippeligaen";
race[7]=[743533,198,'20/06/17 12:00','1161',2482,tmp_host_name,'957',1405,tmp_guest_name,'-1.0','2.5,3.0','11.5 ',tmp_league_name,'7' ,'4','0','0','8',' 9','0','0','0.0,-0.5','5.5','1.0',1];

And the script tag goes longer than the snippet. So, I have 2 problems. One, when I do response=requests.get(url=url) and I do response.content , I can see it only reaches until the end of html tag, so my script tag with all the data is not included. How do I include it with requests?

Second question, how do I scrape this, after I get it?

Well, it appears that it is simply a parser setting that should be adjusted with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

headers = {
    'authority': 'www.scorebing.com',
    'pragma': 'no-cache',
    'cache-control': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'en-US,en;q=0.9',
    
}

response = requests.get('https://www.scorebing.com/match_history/747514', headers=headers)


soup = BeautifulSoup(response.content, 'html.parser', encoding='UTF-8')
soup.find('script', text = re.compile('race_have_corner_handicap'))

Output

<script type="text/javascript">
    var is_en = 1;
    var kind=1,num=6,typenum=1;
    var race=[],league_bgcolor=[],league_i= 1;
    var race_have_corner_handicap=1;
    var home_id = [];
    var guest_id = [];
    home_id.push(1405);    guest_id.push(4503);
...
</script>

The page looks to be updated by a script after loading.

You can bypass this by using use Selenium instead of requests:

from selenium import webdriver
from bs4 import BeautifulSoup
import re

firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)

driver = webdriver.Firefox(firefox_profile=firefox_profile)

driver.get("https://www.scorebing.com/match_history/747514")
soup = BeautifulSoup(driver.page_source)
#Find the script tag that contains specific text:
data = soup.find('script', text = re.compile('race_have_corner_handicap'))
print(data)

Output

<script type="text/javascript">
    var is_en = 1;
    var kind=1,num=6,typenum=1;
    var race=[],league_bgcolor=[],league_i= 1;
    var race_have_corner_handicap=1;
    var home_id = [];
    var guest_id = [];
    home_id.push(1405);    guest_id.push(4503);
...
</script>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM