简体   繁体   中英

I can´t get my scraping from understat data to JSON

I am trying to get information from https://understat.com/league/EPL .

I´ve tried to read and seen what other people have done, but i just can´t get the last puzzle piece together. i´ve manage to decode but i can´t get it in the jsonObject form. Some one that have an idé

import requests
import json
import pandas as pd
import time
import lxml.html as lh
import codecs
from bs4 import BeautifulSoup

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

url = "https://understat.com/league/EPL"
page = requests.get(url)
soup = BeautifulSoup(page.content,'html.parser')

scripts = soup.find_all('script')

for script in scripts:
    if 'var' in script.text:



        encoded_string = script.text
        encoded_string  = encoded_string .split("JSON.parse('", 1)
        encoded_string = encoded_string.rsplit("'),",1)[0]


        jsonStr = codecs.getdecoder('unicode-escape')(encoded_string)[0]
        jsonObj = json.loads(jsonStr)
        print(jsonObj)

raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 2 column 4 (char 4)

here is some data jsonString data:

{"id":"9197","isResult":true,"h":{"id":"89","title":"Manchester United","short_title":"MUN"},"a":{"id":"75","title":"Leicester","short_title":"LEI"},"goals":{"h":"2","a":"1"},"xG":{"h":"1.5137","a":"1.73813"},"datetime":"2018-08-10 22:00:00","forecast":{"w":"0.2812","d":"0.3275","l":"0.3913"}},{"id":"9198","isResult":true,"h":{"id":"86","title":"Newcastle United","short_title":"NEW"},"a":{"id":"82","title":"Tottenham","short_title":"TOT"},"goals":{"h":"1","a":"2"},"xG":{"h":"0.974497","a":"2.58097"},"datetime":"2018-08-11 14:30:00","forecast":{"w":"0.08","d":"0.1479","l":"0.7721"}},{"id":"9199","isResult":true,"h":{"id":"90","title":"Watford","short_title":"WAT"},"a":{"id":"220","title":"Brighton","short_title":"BRI"},"goals":{"h":"2","a":"0"},"xG":{"h":"1.42372","a":"0.45504"},"datetime":"2018-08-11 17:00:00","forecast":{"w":"0.6438","d":"0.2574","l":"0.0988"}},

Try with the following different regex and substring

import requests
import re
import json
import codecs

r = requests.get('https://understat.com/league/EPL')
p = re.compile(r'JSON.parse\((.*)\);')
d = p.findall(r.text)[0]
json_str = codecs.getdecoder('unicode-escape')(d)[0]
data = json.loads(json_str[1:-1])

Sample of print(data) output

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM