简体   繁体   English

如何在python中的html源中解析javascript代码?

[英]How to parse javascript code in html source in Python?

I am trying to web scrape some data inside a JavaScript tag in a HTML source. 我正在尝试通过Web抓取HTML源中JavaScript标记内的一些数据。

The situation: I can get to the appropriate <script></script> tag. 情况:我可以找到相应的<script></script>标记。 But inside that tag, there is a big string, wich needs to be converted and then parsed so I can get the precise data that I need. 但是在该标签内,有一个很大的字符串,需要将其转换然后进行解析,这样我才能获得所需的精确数据。

The problem is: I don't know how to do that and can't find a clear and satisfying answer to do it. 问题是:我不知道该怎么做,也找不到一个明确而令人满意的答案。

Here is the code: 这是代码:

My goal is to get this data: "xe7fd4c285496ab91" which is the identification number of the content, also called "contentId" . 我的目标是获取此数据: "xe7fd4c285496ab91" ,它是内容的标识号,也称为"contentId"

import requests
import bs4
import re

url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link

item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact javascript tag that I need

print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.

I tried to use json.parse() but it is not working: 我尝试使用json.parse()但无法正常工作:

import json
jsonparsed=json.parse(item)

Get this error: 得到这个错误:

AttributeError: 'NavigableString' object has no attribute 'json'

My question is: How can I get the desired data? 我的问题是:如何获得所需的数据? Is there a function to convert the string into javascript so I can parse it? 有将字符串转换为javascript的函数,以便我可以解析它吗? Or a way to convert this string into a JSON file? 还是将此字符串转换为JSON文件的方法?

(Keep in mind that I will do this on multiple links with similar HTML/JavaScript). (请记住,我将在具有类似HTML / JavaScript的多个链接上执行此操作)。

You could just stick with regex on text alone without searching for script 您可以只对文本使用正则表达式,而无需搜索脚本

import re
import requests

r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')  
i = p.findall(r.text)[0]
print(i)

Regex 正则表达式

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM