使用 Selenium 滾動對網頁進行不完全抓取

Question

我正在嘗試從使用 JS 加載渲染 HTML 的網站上抓取產品數據。 我使用了 Selenium，具有滾動到頁面末尾的功能以及重新加載頁面的時間，但我仍然只能抓取網站中排名前 8 的產品。

這是我的代碼：

url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)

last_height = wd.execute_script("return document.documentElement.scrollHeight")
time.sleep(3)


while True:
    # Scroll down to bottom
  wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
  time.sleep(20)

    # Calculate new scroll height and compare with last scroll height
  new_height = wd.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
      break
  last_height = new_height

html = wd.page_source
soup = BS(html, 'lxml')
listings = soup.find('div', {'class': 'MarketplaceProductList__Wrapper-sc-3mfb9g-0 fWFKvm'}).findAll('div', {'class':'MarketplaceProductList__TileWrapper-sc-3mfb9g-2 fQbbTY'})

for item in listings: 
  product_name = item.find('span', {"class":"FallbackHandler__ContentParentWrapper-sk18il-2 jHwpkw"}).get_text(strip=True)
  print(product_name)

如何從頁面上的每個產品中提取信息？ 謝謝！

Answer 1

問題是您必須加載部分頁面然后到達頁面末尾才能獲得整個頁面源。 所以我把頁面分成了 3 個部分並滾動了頁面的 2/3 您也可以滾動到頁面的末尾，但是為什么要浪費時間和 memory。

所有的刮擦幾乎都可以用Selenium本身來完成，因為與BS相比，它提供了更清潔的 output 但是如果你使用帶有 BS 的re模塊，你可以讓它看起來也很漂亮（我留給你選擇）！

使用 Selenium

url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
sleep(3)
height = wd.execute_script("return document.documentElement.scrollHeight;")

#first scroll to load 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3)
sleep(3)

#scroll to the next 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3 + height/3)
sleep(3)

# by now all the scripts would've been loaded

#span tag has all the details you'll need
details = [i.text for i in wd.find_elements_by_tag_name('span') if len(i.text) > 0 and '\n' in i.text]
wd.quit()

for i in details:
    print(i)
    print()

Output

使用美麗的湯

url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
sleep(3)
height = wd.execute_script("return document.documentElement.scrollHeight;")

#first scroll to load 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3)
sleep(3)

#scroll to the next 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3 + height/3)
sleep(3)

#by now all the scripts would've been loaded

soup = BS(wd.page_source, 'lxml')
wd.quit()
#span tag has all the details you'll need
details = [i.text for i in soup.find_all('span') if len(i.text) >20 and 'MSRP $' in i.text]

#removing duplicates
details = list(dict.fromkeys(details))

for i in details:
    print(i)
    print()

Output

使用 Selenium 滾動對網頁進行不完全抓取

問題描述

1 個解決方案

解決方案1
0 2021-01-25 14:24:56

使用 Selenium 滾動對網頁進行不完全抓取

問題描述

1 個解決方案

解決方案1 0 2021-01-25 14:24:56

解決方案1
0 2021-01-25 14:24:56