如何在python中的html源中解析javascript代碼？

Question

我正在嘗試通過Web抓取HTML源中JavaScript標記內的一些數據。

情況：我可以找到相應的<script></script>標記。 但是在該標簽內，有一個很大的字符串，需要將其轉換然后進行解析，這樣我才能獲得所需的精確數據。

問題是：我不知道該怎么做，也找不到一個明確而令人滿意的答案。

這是代碼：

我的目標是獲取此數據： "xe7fd4c285496ab91" ，它是內容的標識號，也稱為"contentId" 。

import requests
import bs4
import re

url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link

item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact javascript tag that I need

print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.

我嘗試使用json.parse()但無法正常工作：

import json
jsonparsed=json.parse(item)

得到這個錯誤：

AttributeError: 'NavigableString' object has no attribute 'json'

我的問題是：如何獲得所需的數據？ 有將字符串轉換為javascript的函數，以便我可以解析它嗎？ 還是將此字符串轉換為JSON文件的方法？

（請記住，我將在具有類似HTML / JavaScript的多個鏈接上執行此操作）。

Answer 1

您可以只對文本使用正則表達式，而無需搜索腳本

import re
import requests

r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')  
i = p.findall(r.text)[0]
print(i)

正則表達式

如何在python中的html源中解析javascript代碼？

問題描述

1 個解決方案

解決方案1
1 已采納 2019-05-04 19:37:16

如何在python中的html源中解析javascript代碼？

問題描述

1 個解決方案

解決方案1 1 已采納 2019-05-04 19:37:16

解決方案1
1 已采納 2019-05-04 19:37:16