簡體   English   中英

如何通過 Selenium 從網站上抓取產品名稱?

[英]How to scrape the product names from the website through Selenium?

我正在嘗試抓取此頁面: https : //redmart.com/fresh-produce/fresh-vegetables 但我面臨的問題是它只返回一些元素。 我使用的代碼如下:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver

# Start the WebDriver and load the page
wd = webdriver.Chrome(executable_path=r"C:\Chrome\chromedriver.exe")
wd.get('https://redmart.com/fresh-produce/fresh-vegetables')

# Wait for the dynamically loaded elements to show up
WebDriverWait(wd, 300).until(
EC.visibility_of_element_located((By.CLASS_NAME, "productDescriptionAndPrice")))

# And grab the page HTML source
html_page = wd.page_source
wd.quit()

# Now you can use html_page as you like
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_page, 'lxml')
print(soup)

我需要使用 Selenium 因為源代碼沒有用,因為頁面是 JAVAscript 生成的。 如果你打開這個頁面,它有大約 60 行產品(總共大約 360 個產品)。 運行此代碼只給我 6 行產品。 停在黃洋蔥上。

謝謝!

根據您的問題和網站https://redmart.com/fresh-produce/fresh-vegetablesSelenium可以輕松抓取所有產品名稱 正如您提到的,總共大約 360 種產品,但只有大約 35 種產品來自特定類別,我為您提供如下解決方案:

  • 代碼塊:

     from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC item_names = [] options = webdriver.ChromeOptions() options.add_argument("start-maximized") options.add_argument('disable-infobars') driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\\Utility\\BrowserDrivers\\chromedriver.exe') driver.get("https://redmart.com/fresh-produce/fresh-vegetables") titles = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='productDescriptionAndPrice']//h4/a"))) for title in titles: item_names.append(title.text) try: driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") titles = WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='productDescriptionAndPrice']//h4/a"))) for title in titles: item_names.append(title.text) except: pass for item_name in item_names: print(item_name) driver.quit()
  • 控制台輸出:

     Eco Leaf Baby Spinach Fresh Vegetable Eco Leaf Kale Fresh Vegetable Sustenir Agriculture Almighty Arugula Sustenir Fresh Toscano Black Kale Sustenir Fresh Kinky Green Curly Kale ThyGrace Honey Cherry Tomato Australian Broccoli Sustenir Agriculture Italian Basil GIVVO Japanese Cucumbers YUVVO Red Onions Australian Cauliflower YUVVO Spring Onion GIVVO Old Ginger GIVVO Cherry Grape Tomatoes YUVVO Holland Potato ThyGrace Traffic Light Capsicum Bell Peppers GIVVO Whole Garlic GIVVO Celery Eco Leaf Baby Spinach Fresh Vegetable Eco Leaf Kale Fresh Vegetable Sustenir Agriculture Almighty Arugula Sustenir Fresh Toscano Black Kale Sustenir Fresh Kinky Green Curly Kale ThyGrace Honey Cherry Tomato Australian Broccoli Sustenir Agriculture Italian Basil GIVVO Japanese Cucumbers YUVVO Red Onions Australian Cauliflower YUVVO Spring Onion GIVVO Old Ginger GIVVO Cherry Grape Tomatoes YUVVO Holland Potato ThyGrace Traffic Light Capsicum Bell Peppers GIVVO Whole Garlic GIVVO Celery

注意:您可以構建更強大的XPATHCSS-SELECTOR以包含更多產品並提取相關產品名稱

這是一些可用的 Java 代碼。 測試等待 30 個元素。

@Test
public void test1() {
    List<WebElement> found = new WebDriverWait(driver, 300).until(wd -> {
        List<WebElement> elements = driver.findElements(By.className("productDescriptionAndPrice"));
        if(elements.size() > 30)
            return elements ;
        ((JavascriptExecutor) driver).executeScript("window.scrollTo(0, document.body.offsetHeight)");
        return null;
    });
    for (WebElement e : found) {
        System.out.println(e.getText());
    }
}

嗨,DebanjanB,感謝您的幫助。 我花了一整天的時間來嘗試這個。 真正的問題是在源代碼中獲取完整的產品列表。 如果一切都在源頭,我認為可以將其提取出來。 我相信當您向下滾動時,來源會發生變化,也許這就是我們只能提取 36 個項目的原因。

考慮到這一點,我的暫定解決方案如下。 它並不完美,因為我必須稍后進行進一步處理以刪除重復項。 如果您有其他想法或可以進一步優化,我將不勝感激。

總體思路是向下滾動,抓取源代碼並追加制作 1 個重疊的大長源代碼。 我有 1400 多種產品以這種方式用於 360 產品頁面,這就是為什么我說這是一個糟糕的解決方案。

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
import time
from bs4 import BeautifulSoup

# Start the WebDriver and load the page
wd = webdriver.Chrome(executable_path=r"C:\Chrome\chromedriver.exe")
wd.delete_all_cookies()
wd.set_page_load_timeout(30)

wd.get('https://redmart.com/fresh-produce/fresh-vegetables#toggle=all')
time.sleep(5)

html_page = wd.page_source
soup = BeautifulSoup(html_page, 'lxml')

while True:
    wd.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    time.sleep(3)
    html_page = wd.page_source
    soup2 = BeautifulSoup(html_page, 'lxml')

    for element in soup2.body:
         soup.body.append(element) 
    time.sleep(2)

    #break condition
    new_height = wd.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height
wd.quit()

results = soup.findAll('div', attrs='class':'productDescriptionAndPrice'})
len(results)
results[0] # this tally with the first product
results[-1] # this tallies with the last

老實說,對這個解決方案非常失望。 謝謝,請隨時讓他們來,讓他們來!

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM