简体   繁体   English

python:无法使用 BeautifulSoup 从 html 获取特定数据

[英]python: can't get specific data from html using BeautifulSoup

I am trying to use beautifulsoup and urllib to pull the given percentage from a particular webpage: https://app.safespace.io/api/display/live-occupancy/86fb9e11?view=percent .我正在尝试使用 beautifulsoup 和 urllib 从特定网页中提取给定的百分比: https://app.safespace.io/api/display/live-occupancy/86fb9e11?view=percent I am very new to stuff like this.我对这样的东西很陌生。 Here is my spaghetti code:这是我的意大利面条代码:

import urllib.request

contentSource = urllib.request.urlopen('https://app.safespace.io/api/display/live-occupancy/86fb9e11?view=percent')
read_content = contentSource.read()

from bs4 import BeautifulSoup
soup = BeautifulSoup(read_content, 'html.parser')

try1 = soup.find("span", {"id": "occupancyPct"})

print(try1)

On the original webpage, when "inspect element"ing the percentage, the percentage actually shows up in the html, as highlighted .在原始网页上,当“检查元素”百分比时,百分比实际上显示在 html 中,如突出显示的那样。

However, my code's printed output is <span class="text-xl" id="occupancyPct" style="margin-bottom: auto;"></span>但是,我的代码打印的 output 是<span class="text-xl" id="occupancyPct" style="margin-bottom: auto;"></span>

Note how my code's output does NOT show the percentage in the output, unlike the actual page's html.请注意我的代码 output 如何不显示 output 中的百分比,这与实际页面的 html 不同。 What am I doing wrong?我究竟做错了什么?

I will also accept "You are stupid because X, and you should do Y instead", or some variation of that.我也会接受“你很愚蠢,因为 X,而你应该做 Y”,或者它的某种变体。

The problem is that the percentage isn't a static field, it's generated/calculated with JavaScript.问题是百分比不是 static 字段,它是使用 JavaScript 生成/计算的。 As far as I know with this type of webscraping you only extract the sourcecode before a JavaScript or something was executed.据我所知,使用这种类型的网页抓取,您只能在 JavaScript 或执行某些操作之前提取源代码。 Therefore this field keeps blank.. Instead of the chrome inspecting tool try the view of the raw source code, the field is unfortunatelly empty.因此该字段保持空白。不幸的是,该字段为空,而不是 chrome 检查工具尝试查看原始源代码。

Here the JavaScript Code that fills the percentage field:这里是填写百分比字段的 JavaScript 代码:

var setters = {
          bgClass: (function() {
            var bgSetter = getNumericClassSetter(elms.bg);

            return function(occupants) {
              // get percent, floored to the nearest 5 percent
              var pct = Number(`${ Number(`${occupants}e+2`) / Number(`${maxCapacity}e+2`) }e+2`);
              var floor = pct - (pct % 5);

              if (floor >= 100) {
                bgSetter(105);
              }
              else {
                bgSetter(floor);
              }
            };
          })(),
          occupancyPct: (function(occupants) {
            elms.occupancyPct.innerText = Math.min(100, Math.floor((occupants / maxCapacity) * 100)) + '%';
          }),
        };

As far as I can see the percentage is calculated with given variables.据我所知,百分比是用给定的变量计算的。 Could it be a soultion to calculate the percentage by your own code instead?用您自己的代码计算百分比是否是一种灵魂?

It is not showing percentages because that percentage is calculated later through javascript.它没有显示百分比,因为该百分比是稍后通过 javascript 计算的。 And the HTML which you are getting is the initial one that does not have the percentage.你得到的 HTML 是第一个没有百分比的。

The answer is simple: You must use selenium答案很简单:必须使用selenium

why?为什么?

You need a navigator, so the javascript code will be executed and the percentage that you are looking for will be there, in the code source of the page, all you have to do is to find a trick to get it.您需要一个导航器,因此 javascript 代码将被执行,并且您正在寻找的百分比将在那里,在页面的代码源中,您所要做的就是找到一个技巧来获取它。

The page is loaded dynamically, therefore requests won't support it.该页面是动态加载的,因此requests将不支持它。 We can Selenium as an alternative to scrape the page.我们可以用 Selenium来替代抓取页面。

Install it with: pip install selenium .安装它: pip install selenium

Download the correct ChromeDriver from here .这里下载正确的 ChromeDriver。

from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup

URL = "https://app.safespace.io/api/display/live-occupancy/86fb9e11?view=percent"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(URL)
# Wait for page to fully render
sleep(5)

soup = BeautifulSoup(driver.page_source, "html.parser")
print(soup.find("span", {"id": "occupancyPct"}).text)

driver.quit()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM