简体   繁体   中英

Web Scraping issue on page load with infinite scroll

i have to scrape an Ecommerce website which loads 45 Products on first page and subsequently loads additional 45 Products upon scrolling to end of the page.

I am using Python a Selenium Web Driver for scraping this page.

The Ajax seems to replace the container upon every subsequent reload and thus am not able to extract all the data after all the products are loaded.

Attaching the code for your ref. please guide me on how to scrape all the products

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import pandas
from numpy import long

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

driver = webdriver.Chrome()
html=driver.get("https://www.ajio.com/women-jackets-coats/c/830316012")
assert 'Ajio' in driver.title
content = driver.find_elements_by_class_name('item')
totalitems=long(driver.find_element_by_class_name('length').text.strip(' Items Found').replace(',','',10))

loop_count=int(((totalitems-len(content))/len(content)))

print(loop_count)

data=[]
row=['Brand','Description','Offer_Price','Original_Price','Discount']
data.append(row)

for i in range(1,loop_count):
    content = driver.find_elements_by_class_name('item') 
    print(i)
    print(len(content))

    for item in content:
        row=[]
        row.append(item.find_element_by_class_name('brand').text.strip())
        row.append(item.find_element_by_class_name('name').text.strip())
        row.append(item.find_element_by_class_name('price').text.strip().strip('Rs. '))
        try:
            row.append(item.find_element_by_class_name('orginal-price').text.strip('Rs. '))
        except NoSuchElementException as exception:
            row.append(item.find_element_by_class_name('price').text.strip('Rs. '))

        try:
            row.append(item.find_element_by_class_name('discount').text.strip())
        except NoSuchElementException as exception:
            row.append("No Discount")

        data.append(row)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight-850);")
    try:
        myElem = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CLASS_NAME, 'loader')))
    except TimeoutException:
        print("Loading took too much time!")



df = pandas.DataFrame(data)
df.to_csv(r"C:\Ajio.csv", sep=',',index=False, header=False, mode='w')   #mode='a' for append

It sounds like the problem that you're having is an inconsistency in the data you're scraping based on subsequent reloads/scrolls.

One solution would be to store a data structure higher than the scope of this function that would record the items you've seen so far. As the page reloads/scrolls you can check if each item exists in the data structure already and if it doesn't add it to the structure until you can ensure that you've hit every possible item on the page.

Good Luck!

This is not an answer, however i have resolved the issue. Instead of using Selenium here, i have used Requests to load the API which returns JSON and then JSON was read. That solved my purpose of web scrapping. Reading JSON was much faster than reading the website using Selenium I passed page no as parameter to the JSON API to load next page data.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM