[英]Extract JSON from HTML Script tag with BeautifulSoup in Python
I have the following HTML, and what should I do to extract the JSON from the variable: window.__INITIAL_STATE__
我有以下 HTML,我应该怎么做才能从变量中提取 JSON:
window.__INITIAL_STATE__
<!DOCTYPE doctype html>
<html lang="en">
<script>
window.sessConf = "-2912474957111138742";
/* <sl:translate_json> */
window.__INITIAL_STATE__ = { /* Target JSON here with 12 million characters */};
/* </sl:translate_json> */
</script>
</html>
You can use the following Python code to extract the JavaScript code.您可以使用以下 Python 代码来提取 JavaScript 代码。
soup = BeautifulSoup(html)
s=soup.find('script')
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));'
with open('temp.js','w') as f:
f.write(js)
The JS code will be written to a file "temp.js". JS 代码将写入文件“temp.js”。 Then you can call
node
to execute the JS file.然后就可以调用
node
执行JS文件了。
from subprocess import check_output
window_init_state = check_output(['node','temp.js'])
The python variable window_init_state
contains the JSON string of the JS object window.__INITIAL_STATE__
, which you can parse in python with JSONDecoder
. python 变量
window_init_state
包含 JS 对象window.__INITIAL_STATE__
的 JSON 字符串,您可以在 python 中使用JSONDecoder
进行解析。
from subprocess import check_output
import json, bs4
html='''<!DOCTYPE doctype html>
<html lang="en">
<script> window.sessConf = "-2912474957111138742";
/* <sl:translate_json> */
window.__INITIAL_STATE__ = { 'Hello':'World'};
/* </sl:translate_json> */
</script>
</html>'''
soup = bs4.BeautifulSoup(html)
with open('temp.js','w') as f:
f.write('window = {};\n'+
soup.find('script').text.strip()+
';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));')
window_init_state = check_output(['node','temp.js'])
print(json.loads(window_init_state))
Output:输出:
{'Hello': 'World'}
gdlmx's code is correct and very helpfull. gdlmx 的代码是正确的,非常有帮助。
from subprocess import check_output
soup = BeautifulSoup(html)
s=soup.find('script')
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));'
window_init_state = check_output(['node','temp.js'])
type(window_init_state) will be . type(window_init_state) 将是 . So then you shuld use following code.
那么你应该使用以下代码。
jsonData= window_init_state.decode("utf-8")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.