[英]Extract array values from JavaScript with Beautiful Soup
I'm trying to build a scraper in Python that gets a variable from JavaScript code within the HTML of a webpage.我正在尝试用 Python 构建一个抓取工具,它从网页 HTML 中的 JavaScript 代码获取变量。 This variable changes over time.
这个变量随时间变化。 Here is the JavaScript code;
这是 JavaScript 代码; I need the first number of the
yValues
variable:我需要
yValues
变量的第一个数字:
jQuery(document).ready(function() {
var draw = true;
if ("Biblioteca di Ingegneria" == "") {
draw = false;
}
if (draw) {
var yValues = [
"28",
"100"
];
var Titolo = "Biblioteca di Ingegneria";
var sottoTitolo = "Posti Totali: 128";
var barColors = [
"#167d21",
"#ed2135"
];
var xValues = [
"Liberi (28)",
"Occupati (100)"
];
new Chart("InOutChart", {
type: "pie",
data: {
labels: xValues,
datasets: [
{
backgroundColor: barColors,
data: yValues
}
]
},
options: {
plugins: {
title: {
display: true,
text: Titolo,
font: {
size: 25,
style: 'normal',
lineHeight: 1.2
},
// padding: {
// top: 10,
// bottom: 30
// }
},
subtitle: {
display: true,
text: sottoTitolo,
font: {
size: 20,
style: 'normal',
lineHeight: 1.2
},
padding: {
bottom: 30
}
},
legend: {
display: true,
position: "bottom",
labels: {
font: {
size: 20,
style: 'normal',
lineHeight: 1.2
}
}
}
},
responsive: true,
maintainAspectRatio: false,
scales: {
yAxes: [
{
display: true,
ticks: {
beginAtZero: true
}
}
]
}
}
});
}
});
This is the best I could do:这是我能做的最好的:
from bs4 import BeautifulSoup
import requests
# Make a GET request to the URL of the web page.
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)
# Parse the HTML content of the page.
soup = BeautifulSoup(response.text, "html.parser")
# Find all the `<script>` elements on the page.
scripts = soup.find_all("script")
# Get the 8th `<script>` element.
script8 = scripts[7]
# Transform the 8th `<script>` into a string.
script8_txt = "".join(script8)
# Get the useful string from the 8th `<script>`.
usefull_txt = script8_txt[248:251]
# Get the int from the string.
pl = int("".join(filter(str.isdigit, usefull_txt)))
print(pl)
This works, but I want to automatically parse the JavaScript code to find the variable and get its value, because as you can see I manually checked the position of the characters that I needed.这可行,但我想自动解析 JavaScript 代码以查找变量并获取其值,因为如您所见,我手动检查了所需字符的位置。 I'm looking for a better solution because I'm planning to use this code for other similar webpages, but the position of the variable changes every time.
我正在寻找更好的解决方案,因为我打算将这段代码用于其他类似的网页,但变量的位置每次都在变化。 Last information: I want to put this Python code in an Alexa skill, so I don't know if Selenium package will work well.
最后的信息:我想把这个 Python 代码放在一个 Alexa 技能中,所以我不知道 Selenium 包是否能正常工作。
Try this:试试这个:
import ast
import requests
from bs4 import BeautifulSoup
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)
script = (
BeautifulSoup(response.text, "html.parser")
.find_all("script")[7]
.string
)
numbers = ast.literal_eval(
script.strip().split("var yValues = ")[1].split(";")[0]
)
print(numbers)
print(numbers[0])
Output:输出:
['130', '0']
130
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.