[英]Extract data from javascript variable JSON.parse using python
very new to python and trying to web scrape a website table, but I think the table data is seemingly from a Javascript variable with a JSON.parse. 对于python来说非常新,并试图通过Web抓取网站表,但我认为表数据似乎来自带有JSON.parse的Javascript变量。 However the parse is not what I am used to and am unsure of how to use it in python. 但是解析不是我习惯的,并且不确定如何在python中使用它。
The code is from this website , specifically it is var playersData = JSON.parse('\\x5B\\x7B\\x22id\\x3A,...
(roughly 250,000 characters) nestled in a script tag. 该代码来自该网站 ,具体地说是var playersData = JSON.parse('\\x5B\\x7B\\x22id\\x3A,...
(大约250,000个字符),位于脚本标签中。
So far I have managed to scrape the website using bs4, find the specific script and attempt to use re.search to find just the JSON.parse and find this <re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\>
到目前为止,我已经设法使用bs4抓取了该网站,找到了特定的脚本,并尝试使用re.search只找到JSON.parse并找到了这个<re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\>
<re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\>
from the search. <re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\>
。
I would then like to export the data somewhere else after loading the JSON parse. 然后,我想在加载JSON解析后将数据导出到其他地方。
Here is my code so far: 到目前为止,这是我的代码:
import requests
from bs4 import BeautifulSoup
import json
import re
response = requests.get('https://understat.com/league/EPL/2018')
soup = BeautifulSoup(response.text, 'lxml')
playerscript = soup.find_all('script')[3].string
m = re.search("var playersData = (.*)", playerscript)
Thanks for any help. 谢谢你的帮助。
you don't need BeautifulSoup. 您不需要BeautifulSoup。 in python json.loads
same as JSON.parse
and you need to convert the string using .decode('string_escape')
or bytes('....', 'utf-8').decode('unicode_escape')
for python 3 在python中, json.loads
与JSON.parse
相同,您需要使用.decode('string_escape')
或bytes('....', 'utf-8').decode('unicode_escape')
转换字符串3
import requests
import json
import re
response = requests.get('https://understat.com/league/EPL/2018')
playersData = re.search("playersData\s+=\s+JSON.parse\('([^']+)", response.text)
# python 2.7
# decoded_string = playersData.groups()[0].decode('string_escape')
decoded_string = bytes(playersData.groups()[0], 'utf-8').decode('unicode_escape')
playerObj = json.loads(decoded_string)
print(playerObj[0]['player_name'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.