![](/img/trans.png)
[英]Scraping webpage with a table rendered using javascript utilizing Selenium Webdriver
[英]Incomplete Scraping on Webpage using Selenium Scrolling
我正在嘗試從使用 JS 加載渲染 HTML 的網站上抓取產品數據。 我使用了 Selenium,具有滾動到頁面末尾的功能以及重新加載頁面的時間,但我仍然只能抓取網站中排名前 8 的產品。
這是我的代碼:
url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
last_height = wd.execute_script("return document.documentElement.scrollHeight")
time.sleep(3)
while True:
# Scroll down to bottom
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(20)
# Calculate new scroll height and compare with last scroll height
new_height = wd.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
html = wd.page_source
soup = BS(html, 'lxml')
listings = soup.find('div', {'class': 'MarketplaceProductList__Wrapper-sc-3mfb9g-0 fWFKvm'}).findAll('div', {'class':'MarketplaceProductList__TileWrapper-sc-3mfb9g-2 fQbbTY'})
for item in listings:
product_name = item.find('span', {"class":"FallbackHandler__ContentParentWrapper-sk18il-2 jHwpkw"}).get_text(strip=True)
print(product_name)
如何從頁面上的每個產品中提取信息? 謝謝!
問題是您必須加載部分頁面然后到達頁面末尾才能獲得整個頁面源。 所以我把頁面分成了 3 個部分並滾動了頁面的 2/3 您也可以滾動到頁面的末尾,但是為什么要浪費時間和 memory。
所有的刮擦幾乎都可以用Selenium
本身來完成,因為與BS
相比,它提供了更清潔的 output 但是如果你使用帶有 BS 的re
模塊,你可以讓它看起來也很漂亮(我留給你選擇)!
使用 Selenium
url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
sleep(3)
height = wd.execute_script("return document.documentElement.scrollHeight;")
#first scroll to load 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3)
sleep(3)
#scroll to the next 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3 + height/3)
sleep(3)
# by now all the scripts would've been loaded
#span tag has all the details you'll need
details = [i.text for i in wd.find_elements_by_tag_name('span') if len(i.text) > 0 and '\n' in i.text]
wd.quit()
for i in details:
print(i)
print()
Output
使用美麗的湯
url = 'https://www.faire.com/retailer/r_9vkjixqbpq/category/Beauty%20&%20Wellness/subcategory/Bath%20&%20Body?filters=sorting%3Afeatured'
wd.get(url)
sleep(3)
height = wd.execute_script("return document.documentElement.scrollHeight;")
#first scroll to load 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3)
sleep(3)
#scroll to the next 1/3 rd of the page
wd.execute_script('window.scrollTo(0, arguments[0]);',height/3 + height/3)
sleep(3)
#by now all the scripts would've been loaded
soup = BS(wd.page_source, 'lxml')
wd.quit()
#span tag has all the details you'll need
details = [i.text for i in soup.find_all('span') if len(i.text) >20 and 'MSRP $' in i.text]
#removing duplicates
details = list(dict.fromkeys(details))
for i in details:
print(i)
print()
Output
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.