簡體   English   中英

使用 Selenium 滾動對網頁進行不完全抓取

[英]Incomplete Scraping on Webpage using Selenium Scrolling

我正在嘗試從使用 JS 加載渲染 HTML 的網站上抓取產品數據。 我使用了 Selenium,具有滾動到頁面末尾的功能以及重新加載頁面的時間,但我仍然只能抓取網站中排名前 8 的產品。

這是我的代碼:

url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)

last_height = wd.execute_script("return document.documentElement.scrollHeight")
time.sleep(3)


while True:
    # Scroll down to bottom
  wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load page
  time.sleep(20)

    # Calculate new scroll height and compare with last scroll height
  new_height = wd.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
      break
  last_height = new_height

html = wd.page_source
soup = BS(html, 'lxml')
listings = soup.find('div', {'class': 'MarketplaceProductList__Wrapper-sc-3mfb9g-0 fWFKvm'}).findAll('div', {'class':'MarketplaceProductList__TileWrapper-sc-3mfb9g-2 fQbbTY'})

for item in listings: 
  product_name = item.find('span', {"class":"FallbackHandler__ContentParentWrapper-sk18il-2 jHwpkw"}).get_text(strip=True)
  print(product_name)

如何從頁面上的每個產品中提取信息? 謝謝!

問題是您必須加載部分頁面然后到達頁面末尾才能獲得整個頁面源。 所以我把頁面分成了 3 個部分並滾動了頁面的 2/3 您也可以滾動到頁面的末尾,但是為什么要浪費時間和 memory。

所有的刮擦幾乎都可以用Selenium本身來完成,因為與BS相比,它提供了更清潔的 output 但是如果你使用帶有 BS 的re模塊,你可以讓它看起來也很漂亮(我留給你選擇)!

使用 Selenium

url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
sleep(3)
height = wd.execute_script("return document.documentElement.scrollHeight;")

#first scroll to load 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3)
sleep(3)

#scroll to the next 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3 + height/3)
sleep(3)

# by now all the scripts would've been loaded

#span tag has all the details you'll need
details = [i.text for i in wd.find_elements_by_tag_name('span') if len(i.text) > 0 and '\n' in i.text]
wd.quit()

for i in details:
    print(i)
    print()

Output

硒的輸出

使用美麗的湯

url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
sleep(3)
height = wd.execute_script("return document.documentElement.scrollHeight;")

#first scroll to load 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3)
sleep(3)

#scroll to the next 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3 + height/3)
sleep(3)

#by now all the scripts would've been loaded

soup = BS(wd.page_source, 'lxml')
wd.quit()
#span tag has all the details you'll need
details = [i.text for i in soup.find_all('span') if len(i.text) >20 and 'MSRP $' in i.text]

#removing duplicates
details = list(dict.fromkeys(details))

for i in details:
    print(i)
    print()

Output

使用 BS 輸出

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM