無法使用 Python 的 Beautiful Soup 從特定的 span 標簽中提取文本

Question

我目前正在抓取該網站以構建汽車數據集，並且我構建了一個方程式，用於在抓取時循環瀏覽網站的每個頁面。 但是，我無法提取完成這項工作所需的文本。

下面的代碼片段是我要抓取的標簽。 我需要獲取站點上的車輛數量。

<span class="d-none d-sm-inline">166 Vehicles</span>

這張圖片顯示了我試圖抓取的網站元素

下面是我用來抓取該元素的代碼：

# Packages
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests
    
print("Started web scrape...")
    
limit = 10
start = 0 #increment by limit
website = requests.get(f'https://www.sosubaru.com/new-inventory/index.htm?start={start}')
soup = BeautifulSoup(website.text, 'html.parser')
    
inventory_count = soup.select("span.d-none.d-sm-inline")[0].string
    
print(inventory_count)

此代碼返回以下內容：

Started OR_GP_Roe_Motors web scrape...
Traceback (most recent call last):
  File "c:/mypath...", line 16, in <module>
    inventory_count = soup.select("span.d-none.d-sm-inline")[0].string
IndexError: list index out of range

然后我通過返回soup.select給我的所有內容來檢查為什么我得到了那個錯誤代碼：

inventory_count = soup.select("span.d-none.d-sm-inline")
print(inventory_count)

返回：

Started web scrape...
[]

為什么它給我一個空列表？

然后我告訴它打印網站上的每個跨度標簽，看看它是否存在。 結果打印出許多跨度標簽，但不包括我正在尋找的標簽。 為什么我不能用漂亮的湯發現它？ 是我正在使用的解析器嗎？ 我嘗試使用“lxml”作為解析器，但它沒有改變任何東西。 這與網站是 html xmls 文檔這一事實有關嗎？

我已經刮了幾個網站，直到現在還沒有遇到過這樣的問題。

Answer 1

您想要的數據和標簽不會出現在 html 源中，這意味着它們是由 javascript 添加的。 You can either use selenium to get the page source after it has been rendered or you can use requests_html, which has an API similar to BeautifulSoup and it has the option to render a page's javascript before scraping it.

from requests_html import HTMLSession

s = HTMLSession()
r = s.get(url)
r.html.render()
r.find . . . [whatever you want to search for]

無法使用 Python 的 Beautiful Soup 從特定的 span 標簽中提取文本

問題描述

1 個解決方案

解決方案1
1 已采納 2021-02-06 00:59:55

無法使用 Python 的 Beautiful Soup 從特定的 span 標簽中提取文本

問題描述

1 個解決方案

解決方案1 1 已采納 2021-02-06 00:59:55

解決方案1
1 已采納 2021-02-06 00:59:55