简体   繁体   中英

How to get the data from a script tag on a website using Python?

I am trying to pull the table data from this website - ' https://understat.com/league/EPL ' When I viewed the Source code, the table is saved in a. I want to know how to extract the data from the script in a usable format.

I tried using the solution from a similar question ( How to Get Script Tag Variables From a Website using Python ):

    import requests
    import bs4
    import json

    url = 'https://understat.com/league/EPL'
    r = requests.get(url)

    bs = bs4.BeautifulSoup(r.text, "html.parser")
    scripts = bs.find_all('script')

    for s in scripts:
        if 'var datesData' in s.text:
            script = s.text
            print(script)

However, nothing is getting printed, that is, it can't find 'var datesData' in the script, but when I just print(scripts), I get:

[<script>
            var THEME = localStorage.getItem("theme") || 'DARK';
            document.body.className = "theme-" + THEME.toLowerCase();
        </script>,
 <script>
    var datesData   = JSON.parse('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\x3A\x7B\x22id\x22\x3A\x2287\x22,\x22title\x22\x3A\x22Liverpool\x22,\x22short_title\x22\x3A\x22LIV\x22\x7D,\x22a\x22\x3A\x7B\x22id\x22\x3A\x2279\x22,\x22title\x22\x3A\x22Norwich\x22,\x22short_title\x22\x3A\x22NOR...


and so on
]

As you can see, the second list contains 'var datesData' but my code won't print it.

What I want is to get that second script from the list and get the data within the JSON.parse() so I can create a dataframe eventually. One option I can do is copy that entire line from the url's source code and pass it on to json.loads() to use it like:

js = json.loads('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\...')

which gives me an output of:

[{'id': '11643',
  'isResult': True,
  'h': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'},
  'a': {'id': '79', 'title': 'Norwich', 'short_title': 'NOR'},
  'goals': {'h': '4', 'a': '1'},
  'xG': {'h': '2.23456', 'a': '0.842407'},
  'datetime': '2019-08-09 20:00:00',
  'forecast': {'w': '0.7377', 'd': '0.1732', 'l': '0.0891'}},
 {'id': '11644',
  'isResult': True,
  'h': {'id': '81', 'title': 'West Ham', 'short_title': 'WHU'},
  'a': {'id': '88', 'title': 'Manchester City', 'short_title': 'MCI'},
  'goals': {'h': '0', 'a': '5'},
  'xG': {'h': '1.2003', 'a': '3.18377'},
  'datetime': '2019-08-10 12:30:00',
  'forecast': {'w': '0.0452', 'd': '0.1166', 'l': '0.8382'}},
 {'id': '11645',
  'isResult': True,
...

However, the better way is to call the data from the website so I can account for changes that WILL happen later to the data.

TLDR: I want to read the data stored in a script tag in a readable format using Python

Perhaps something like

import ast
import json
import re
from pprint import pprint

import requests

pattern = re.compile(r'\bvar\s+datesData\s*=\s*JSON\.parse\((.+?)\)')

url = 'https://understat.com/league/EPL'

r = requests.get(url)
s = r.text
m = pattern.search(s)
data = m.group(1)
o = json.loads(ast.literal_eval(data))
pprint(o[:3])

which gives me

[{'a': {'id': '79', 'short_title': 'NOR', 'title': 'Norwich'},
  'datetime': '2019-08-09 20:00:00',
  'forecast': {'d': '0.1732', 'l': '0.0891', 'w': '0.7377'},
  'goals': {'a': '1', 'h': '4'},
  'h': {'id': '87', 'short_title': 'LIV', 'title': 'Liverpool'},
  'id': '11643',
  'isResult': True,
  'xG': {'a': '0.842407', 'h': '2.23456'}},
 {'a': {'id': '88', 'short_title': 'MCI', 'title': 'Manchester City'},
  'datetime': '2019-08-10 12:30:00',
  'forecast': {'d': '0.1166', 'l': '0.8382', 'w': '0.0452'},
  'goals': {'a': '5', 'h': '0'},
  'h': {'id': '81', 'short_title': 'WHU', 'title': 'West Ham'},
  'id': '11644',
  'isResult': True,
  'xG': {'a': '3.18377', 'h': '1.2003'}},
 {'a': {'id': '238', 'short_title': 'SHE', 'title': 'Sheffield United'},
  'datetime': '2019-08-10 15:00:00',
  'forecast': {'d': '0.3923', 'l': '0.3994', 'w': '0.2083'},
  'goals': {'a': '1', 'h': '1'},
  'h': {'id': '73', 'short_title': 'BOU', 'title': 'Bournemouth'},
  'id': '11645',
  'isResult': True,
  'xG': {'a': '1.59864', 'h': '1.34099'}}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM