简体   繁体   English

使用 Beautiful Soup 从 JavaScript 中提取数组值

[英]Extract array values from JavaScript with Beautiful Soup

I'm trying to build a scraper in Python that gets a variable from JavaScript code within the HTML of a webpage.我正在尝试用 Python 构建一个抓取工具,它从网页 HTML 中的 JavaScript 代码获取变量。 This variable changes over time.这个变量随时间变化。 Here is the JavaScript code;这是 JavaScript 代码; I need the first number of the yValues variable:我需要yValues变量的第一个数字:

jQuery(document).ready(function() {
  var draw = true;
  
  if ("Biblioteca di Ingegneria" == "") {
    draw = false;
  }
  
  if (draw) {
    var yValues = [
        "28",
        "100"
      ];
    var Titolo = "Biblioteca di Ingegneria";
    var sottoTitolo = "Posti Totali: 128";
    var barColors = [
        "#167d21",
        "#ed2135"
      ];
    var xValues = [
        "Liberi (28)",
        "Occupati (100)"
      ];
    
    new Chart("InOutChart", {
      type: "pie",
      data: {
        labels: xValues,
        datasets: [
          {
            backgroundColor: barColors,
            data: yValues
          }
        ]
      },
      options: {
        plugins: {
          title: {
            display: true,
            text: Titolo,
            font: {
              size: 25,
              style: 'normal',
              lineHeight: 1.2
            },
            // padding: {
            //   top: 10,
            //   bottom: 30
            // }
          },
          subtitle: {
            display: true,
            text: sottoTitolo,
            font: {
              size: 20,
              style: 'normal',
              lineHeight: 1.2
            },
            padding: {
              bottom: 30
            }
          },
          legend: {
            display: true,
            position: "bottom",
            labels: {
              font: {
                size: 20,
                style: 'normal',
                lineHeight: 1.2
              }
            }
          }
        },
        responsive: true,
        maintainAspectRatio: false,
        scales: {
          yAxes: [
            {
              display: true,
              ticks: {
                beginAtZero: true
              }
            }
          ]
        }
      }
    });
  }
});

This is the best I could do:这是我能做的最好的:

from bs4 import BeautifulSoup
import requests

# Make a GET request to the URL of the web page.
base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)

# Parse the HTML content of the page.
soup = BeautifulSoup(response.text, "html.parser")

# Find all the `<script>` elements on the page.
scripts = soup.find_all("script")

# Get the 8th `<script>` element.
script8 = scripts[7]

# Transform the 8th `<script>` into a string.
script8_txt = "".join(script8)

# Get the useful string from the 8th `<script>`.
usefull_txt = script8_txt[248:251]
        
# Get the int from the string.
pl = int("".join(filter(str.isdigit, usefull_txt)))

print(pl)

This works, but I want to automatically parse the JavaScript code to find the variable and get its value, because as you can see I manually checked the position of the characters that I needed.这可行,但我想自动解析 JavaScript 代码以查找变量并获取其值,因为如您所见,我手动检查了所需字符的位置。 I'm looking for a better solution because I'm planning to use this code for other similar webpages, but the position of the variable changes every time.我正在寻找更好的解决方案,因为我打算将这段代码用于其他类似的网页,但变量的位置每次都在变化。 Last information: I want to put this Python code in an Alexa skill, so I don't know if Selenium package will work well.最后的信息:我想把这个 Python 代码放在一个 Alexa 技能中,所以我不知道 Selenium 包是否能正常工作。

Try this:试试这个:

import ast

import requests
from bs4 import BeautifulSoup

base_url = 'https://qrbiblio.unipi.it/Home/Chart?IdCat=a96d84ba-46e8-47a1-b947-ab98a8746d6f'
response = requests.get(base_url)

script = (
    BeautifulSoup(response.text, "html.parser")
    .find_all("script")[7]
    .string
)
numbers = ast.literal_eval(
    script.strip().split("var yValues = ")[1].split(";")[0]
)
print(numbers)
print(numbers[0])

Output:输出:

['130', '0']
130

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM