簡體   English   中英

使用 Selenium 進行刮削返回“無:

[英]Scraping w/ Selenium returns "none:

我正在嘗試使用 selenium 從 capterra 中抓取公司簡介頁面。 Capterra 分批加載 25 個配置文件頁面。我的代碼能夠獲取前 5 個,但隨后為頁面上的其他 20 個返回“無”。

代碼:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager


driver = webdriver.Firefox()
driver.get("https://www.capterra.com/waste-management-software/")

page = bs(driver.page_source, 'html.parser')
# Hits "Show More" button
driver.find_element(By. XPATH, "//*[contains(text(), 'Show More')]").click()
# Grabs Company portfolio page links
plinks = [div.a for div in page.findAll("div", attrs={"class" : "nb-mb-0"})]

for link in plinks:
    print(link)

driver.close()

輸出:

<a class="nb-thumbnail nb-relative nb-thumbnail-medium nb-thumbnail-interactive" href="/p/81310/AMCS/"><img alt="" class="nb-max-h-full" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/946474e4-bd54-451d-bbaf-9c5602b2f399.png?auto=compress%2Cformat&amp;w=180&amp;h=180"/></a>
<a class="nb-thumbnail nb-relative nb-thumbnail-medium nb-thumbnail-interactive" href="/p/103755/HazMat-T-T/"><img alt="" class="nb-max-h-full" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/838db9d8-c251-4d78-aa69-a9cd745ef6b9.png?auto=compress%2Cformat&amp;w=180&amp;h=180"/></a>
<a class="nb-thumbnail nb-relative nb-thumbnail-medium nb-thumbnail-interactive" href="/p/79230/WAM-Hauler-Easy-Bill-Route/"><img alt="" class="nb-max-h-full" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/0820f6ea-9d9d-4062-987b-a3fcf25f2813.png?auto=compress%2Cformat&amp;w=180&amp;h=180"/></a>
<a class="nb-thumbnail nb-relative nb-thumbnail-medium nb-thumbnail-interactive" href="/p/152697/Waste-Management-Software/"><img alt="" class="nb-max-h-full" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/64597b5d-84e5-464c-ae60-84a1c5ad4976.png?auto=compress%2Cformat&amp;w=180&amp;h=180"/></a>
<a class="nb-thumbnail nb-relative nb-thumbnail-medium nb-thumbnail-interactive" href="/p/177472/Via-Analytics/"><img alt="" class="nb-max-h-full" loading="lazy" src="https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/c20bf8d6-88cc-49d5-8424-b724ba734d4a.png?auto=compress%2Cformat&amp;w=180&amp;h=180"/></a>
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

我真正需要的輸出是包含“/p/”的href。 點擊頁面上的“顯示更多”按鈕,然后收集接下來的25個鏈接,點擊按鈕等。

謝謝!

你不需要硒。 在這里,您有一個API ,您可以通過一個請求直接抓取 API,它會返回您需要的125 個對象

import json
import requests

headers = {
    'accept': '*/*',
    'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,es;q=0.7,ru;q=0.6',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}
params = {'htmlName': 'waste-management-software', 'countryCode': 'ES'}
base_url = "https://www.capterra.com/p/"
response = requests.get('https://www.capterra.com/directoryPage/rest/v1/category', params=params, headers=headers)
json = json.loads(response.content)

products = json["pageData"]["categoryData"]["products"]
print("Total elements: " + str(len(products)))
for product in products:
    print("Name: " + product["product_name"])
    print("URL: " + base_url + str(product["product_id"]) + "/" + product["product_slug"] + "/")
    print("Product url: " + product["product_url"])
    print("Image: " + product["logo_filepath"])
    print("Rating: " + str(product["rating"]))
    print()

輸出:

Total elements: 125
Name: FAMA
URL: https://www.capterra.com/p/86768/FAMA/
Product url: https://info.gartnerdigitalmarkets.com/fama-es-gdm-lp
Image: https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/7a7a8467-9a2c-40d9-8488-7d6c3c0dec52.jpeg
Rating: 3.6

Name: Quentic
URL: https://www.capterra.com/p/127188/Quentic/
Product url: https://go.quentic.com/hazardous-materials-management-software
Image: https://gdm-catalog-fmapi-prod.imgix.net/ProductLogo/ba5e26a7-375d-4423-a1f2-68a27d5318c5.png
Rating: 4.8

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM