[英]How to extract specific script element from HTML using Beautiful Soup
I am using BS4 to extract info from a football stats page.我正在使用 BS4 从足球统计页面中提取信息。 Here is how I have started:
这是我开始的方式:
from bs4 import BeautifulSoup as bs
import requests
res = requests.get(url)
soup = bs(res.content, 'lxml')
scripts = soup.find_all('script')
scripts = [script for script in scripts]
This successfully returns all script elements as a list.这成功地将所有脚本元素作为列表返回。
I need to extract a specific script element我需要提取特定的脚本元素
Specifically, one which begins as follows:具体来说,一个开始如下:
<script>
var teamsData = JSON.parse('\x7B\x2271\x22\x3A\x7B\x22id\x22\x3A\x2271\x22,\x22title\x22\x3A\x22Aston\x20Villa\x22,\x22history\x22\x3A\x5B\x5D\x7D,\x2272\x22\x3A\x7B\x22id\x22\x3A\x2272\x22...
</script>
I have tried various iterations of the following code, but the output always prints as blank:我尝试了以下代码的各种迭代,但输出始终打印为空白:
for script in scripts:
if 'teamsData' in script.text:
print(script)
I could always resort to simply using 'print(scripts[2])', but I wanted to know why my initial efforts failed.我总是可以简单地使用“print(scripts[2])”,但我想知道为什么我最初的努力失败了。
Thanks!谢谢!
Apparently, .text
is always an empty string for script tags.显然,
.text
始终是脚本标签的空字符串。 You can, however, get the contents of the tag from .children
但是,您可以从
.children
获取标签的内容
from bs4 import BeautifulSoup
from io import StringIO
html = """
<script>
let a = "Hello";
</script>
"""
b = StringIO(html)
soup = BeautifulSoup(b, 'lxml')
for e in soup.find_all('script'):
print(repr(e.text))
print(repr(''.join(e.children)))
You can use .string
to access the <script>
string:您可以使用
.string
访问<script>
字符串:
import re
import json
from bs4 import BeautifulSoup
html_doc = '''<script>
var teamsData = JSON.parse('\x7B\x2271\x22\x3A\x7B\x22id\x22\x3A\x2271\x22,\x22title\x22\x3A\x22Aston\x20Villa\x22,\x22history\x22\x3A\x5B\x5D\x7D,\x2272\x22\x3A\x7B\x22id\x22\x3A\x2272\x22\x7D\x7D');
</script>'''
soup = BeautifulSoup(html_doc, 'html.parser')
script_string = soup.find('script').string
print(script_string)
Prints:印刷:
var teamsData = JSON.parse('{"71":{"id":"71","title":"Aston Villa","history":[]},"72":{"id":"72"}}');
To parse the JSON data, you can use re
/ json
modules.要解析 JSON 数据,您可以使用
re
/ json
模块。 For example:例如:
data = re.search(r"JSON\.parse\('(.*?)'\);", script_string).group(1)
data = json.loads(data)
for k, v in data.items():
print(k, v)
Prints:印刷:
71 {'id': '71', 'title': 'Aston Villa', 'history': []}
72 {'id': '72'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.