简体   繁体   English

使用python从javascript变量JSON.parse中提取数据

[英]Extract data from javascript variable JSON.parse using python

very new to python and trying to web scrape a website table, but I think the table data is seemingly from a Javascript variable with a JSON.parse. 对于python来说非常新,并试图通过Web抓取网站表,但我认为表数据似乎来自带有JSON.parse的Javascript变量。 However the parse is not what I am used to and am unsure of how to use it in python. 但是解析不是我习惯的,并且不确定如何在python中使用它。

The code is from this website , specifically it is var playersData = JSON.parse('\\x5B\\x7B\\x22id\\x3A,... (roughly 250,000 characters) nestled in a script tag. 该代码来自该网站 ,具体地说是var playersData = JSON.parse('\\x5B\\x7B\\x22id\\x3A,... (大约250,000个字符),位于脚本标签中。

So far I have managed to scrape the website using bs4, find the specific script and attempt to use re.search to find just the JSON.parse and find this <re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\> 到目前为止,我已经设法使用bs4抓取了该网站,找到了特定的脚本,并尝试使用re.search只找到JSON.parse并找到了这个<re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\> <re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\> from the search. <re.Match object; span=(2, 259126), match="var playersData\\t= JSON.parse('\\\\x5B\\\\x7B\\\\x22id\\>

I would then like to export the data somewhere else after loading the JSON parse. 然后,我想在加载JSON解析后将数据导出到其他地方。

Here is my code so far: 到目前为止,这是我的代码:

import requests
from bs4 import BeautifulSoup
import json
import re

response = requests.get('https://understat.com/league/EPL/2018')
soup = BeautifulSoup(response.text, 'lxml')

playerscript = soup.find_all('script')[3].string
m = re.search("var playersData  = (.*)", playerscript)

Thanks for any help. 谢谢你的帮助。

you don't need BeautifulSoup. 您不需要BeautifulSoup。 in python json.loads same as JSON.parse and you need to convert the string using .decode('string_escape') or bytes('....', 'utf-8').decode('unicode_escape') for python 3 在python中, json.loadsJSON.parse相同,您需要使用.decode('string_escape')bytes('....', 'utf-8').decode('unicode_escape')转换字符串3

import requests
import json
import re

response = requests.get('https://understat.com/league/EPL/2018')
playersData = re.search("playersData\s+=\s+JSON.parse\('([^']+)", response.text)
# python 2.7
# decoded_string = playersData.groups()[0].decode('string_escape')
decoded_string = bytes(playersData.groups()[0], 'utf-8').decode('unicode_escape')
playerObj = json.loads(decoded_string)

print(playerObj[0]['player_name'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM