[英]How to parse javascript code in html source in Python?
I am trying to web scrape some data inside a JavaScript tag in a HTML source. 我正在尝试通过Web抓取HTML源中JavaScript标记内的一些数据。
The situation: I can get to the appropriate <script></script>
tag. 情况:我可以找到相应的
<script></script>
标记。 But inside that tag, there is a big string, wich needs to be converted and then parsed so I can get the precise data that I need. 但是在该标签内,有一个很大的字符串,需要将其转换然后进行解析,这样我才能获得所需的精确数据。
The problem is: I don't know how to do that and can't find a clear and satisfying answer to do it. 问题是:我不知道该怎么做,也找不到一个明确而令人满意的答案。
Here is the code: 这是代码:
My goal is to get this data: "xe7fd4c285496ab91"
which is the identification number of the content, also called "contentId"
. 我的目标是获取此数据:
"xe7fd4c285496ab91"
,它是内容的标识号,也称为"contentId"
。
import requests
import bs4
import re
url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link
item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact javascript tag that I need
print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.
I tried to use json.parse()
but it is not working: 我尝试使用
json.parse()
但无法正常工作:
import json
jsonparsed=json.parse(item)
Get this error: 得到这个错误:
AttributeError: 'NavigableString' object has no attribute 'json'
My question is: How can I get the desired data? 我的问题是:如何获得所需的数据? Is there a function to convert the string into javascript so I can parse it?
有将字符串转换为javascript的函数,以便我可以解析它吗? Or a way to convert this string into a JSON file?
还是将此字符串转换为JSON文件的方法?
(Keep in mind that I will do this on multiple links with similar HTML/JavaScript). (请记住,我将在具有类似HTML / JavaScript的多个链接上执行此操作)。
You could just stick with regex on text alone without searching for script 您可以只对文本使用正则表达式,而无需搜索脚本
import re
import requests
r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')
i = p.findall(r.text)[0]
print(i)
Regex 正则表达式
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.