I am trying to pull the table data from this website - ' https://understat.com/league/EPL ' When I viewed the Source code, the table is saved in a. I want to know how to extract the data from the script in a usable format.
I tried using the solution from a similar question ( How to Get Script Tag Variables From a Website using Python ):
import requests
import bs4
import json
url = 'https://understat.com/league/EPL'
r = requests.get(url)
bs = bs4.BeautifulSoup(r.text, "html.parser")
scripts = bs.find_all('script')
for s in scripts:
if 'var datesData' in s.text:
script = s.text
print(script)
However, nothing is getting printed, that is, it can't find 'var datesData' in the script, but when I just print(scripts), I get:
[<script>
var THEME = localStorage.getItem("theme") || 'DARK';
document.body.className = "theme-" + THEME.toLowerCase();
</script>,
<script>
var datesData = JSON.parse('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\x3A\x7B\x22id\x22\x3A\x2287\x22,\x22title\x22\x3A\x22Liverpool\x22,\x22short_title\x22\x3A\x22LIV\x22\x7D,\x22a\x22\x3A\x7B\x22id\x22\x3A\x2279\x22,\x22title\x22\x3A\x22Norwich\x22,\x22short_title\x22\x3A\x22NOR...
and so on
]
As you can see, the second list contains 'var datesData' but my code won't print it.
What I want is to get that second script from the list and get the data within the JSON.parse() so I can create a dataframe eventually. One option I can do is copy that entire line from the url's source code and pass it on to json.loads() to use it like:
js = json.loads('\x5B\x7B\x22id\x22\x3A\x2211643\x22,\x22isResult\x22\x3Atrue,\x22h\x22\...')
which gives me an output of:
[{'id': '11643',
'isResult': True,
'h': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'},
'a': {'id': '79', 'title': 'Norwich', 'short_title': 'NOR'},
'goals': {'h': '4', 'a': '1'},
'xG': {'h': '2.23456', 'a': '0.842407'},
'datetime': '2019-08-09 20:00:00',
'forecast': {'w': '0.7377', 'd': '0.1732', 'l': '0.0891'}},
{'id': '11644',
'isResult': True,
'h': {'id': '81', 'title': 'West Ham', 'short_title': 'WHU'},
'a': {'id': '88', 'title': 'Manchester City', 'short_title': 'MCI'},
'goals': {'h': '0', 'a': '5'},
'xG': {'h': '1.2003', 'a': '3.18377'},
'datetime': '2019-08-10 12:30:00',
'forecast': {'w': '0.0452', 'd': '0.1166', 'l': '0.8382'}},
{'id': '11645',
'isResult': True,
...
However, the better way is to call the data from the website so I can account for changes that WILL happen later to the data.
TLDR: I want to read the data stored in a script tag in a readable format using Python
Perhaps something like
import ast
import json
import re
from pprint import pprint
import requests
pattern = re.compile(r'\bvar\s+datesData\s*=\s*JSON\.parse\((.+?)\)')
url = 'https://understat.com/league/EPL'
r = requests.get(url)
s = r.text
m = pattern.search(s)
data = m.group(1)
o = json.loads(ast.literal_eval(data))
pprint(o[:3])
which gives me
[{'a': {'id': '79', 'short_title': 'NOR', 'title': 'Norwich'},
'datetime': '2019-08-09 20:00:00',
'forecast': {'d': '0.1732', 'l': '0.0891', 'w': '0.7377'},
'goals': {'a': '1', 'h': '4'},
'h': {'id': '87', 'short_title': 'LIV', 'title': 'Liverpool'},
'id': '11643',
'isResult': True,
'xG': {'a': '0.842407', 'h': '2.23456'}},
{'a': {'id': '88', 'short_title': 'MCI', 'title': 'Manchester City'},
'datetime': '2019-08-10 12:30:00',
'forecast': {'d': '0.1166', 'l': '0.8382', 'w': '0.0452'},
'goals': {'a': '5', 'h': '0'},
'h': {'id': '81', 'short_title': 'WHU', 'title': 'West Ham'},
'id': '11644',
'isResult': True,
'xG': {'a': '3.18377', 'h': '1.2003'}},
{'a': {'id': '238', 'short_title': 'SHE', 'title': 'Sheffield United'},
'datetime': '2019-08-10 15:00:00',
'forecast': {'d': '0.3923', 'l': '0.3994', 'w': '0.2083'},
'goals': {'a': '1', 'h': '1'},
'h': {'id': '73', 'short_title': 'BOU', 'title': 'Bournemouth'},
'id': '11645',
'isResult': True,
'xG': {'a': '1.59864', 'h': '1.34099'}}]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.