如何在python中的html源中解析javascript代码？

Question

I am trying to web scrape some data inside a JavaScript tag in a HTML source. 我正在尝试通过Web抓取HTML源中JavaScript标记内的一些数据。

The situation: I can get to the appropriate <script></script> tag. 情况：我可以找到相应的<script></script>标记。 But inside that tag, there is a big string, wich needs to be converted and then parsed so I can get the precise data that I need. 但是在该标签内，有一个很大的字符串，需要将其转换然后进行解析，这样我才能获得所需的精确数据。

The problem is: I don't know how to do that and can't find a clear and satisfying answer to do it. 问题是：我不知道该怎么做，也找不到一个明确而令人满意的答案。

Here is the code: 这是代码：

My goal is to get this data: "xe7fd4c285496ab91" which is the identification number of the content, also called "contentId" . 我的目标是获取此数据： "xe7fd4c285496ab91" ，它是内容的标识号，也称为"contentId" 。

import requests
import bs4
import re

url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link

item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact javascript tag that I need

print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.

I tried to use json.parse() but it is not working: 我尝试使用json.parse()但无法正常工作：

import json
jsonparsed=json.parse(item)

Get this error: 得到这个错误：

AttributeError: 'NavigableString' object has no attribute 'json'

My question is: How can I get the desired data? 我的问题是：如何获得所需的数据？ Is there a function to convert the string into javascript so I can parse it? 有将字符串转换为javascript的函数，以便我可以解析它吗？ Or a way to convert this string into a JSON file? 还是将此字符串转换为JSON文件的方法？

(Keep in mind that I will do this on multiple links with similar HTML/JavaScript). （请记住，我将在具有类似HTML / JavaScript的多个链接上执行此操作）。

Answer 1

You could just stick with regex on text alone without searching for script 您可以只对文本使用正则表达式，而无需搜索脚本

import re
import requests

r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')  
i = p.findall(r.text)[0]
print(i)

Regex 正则表达式

如何在python中的html源中解析javascript代码？

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-05-04 19:37:16

如何在python中的html源中解析javascript代码？

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-05-04 19:37:16

解决方案1
1 已采纳 2019-05-04 19:37:16