I am trying to scrape some data from a website. This website has a 'load more products' button. I'm using:
driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()
to hit the button and this loops for a set number of iterations.
The problem I'm running into is once those number of iterations have been completed, I want to extract text from the webpage using:
posts = driver.find_elements_by_class_name("hotProductDetails")
However, this seems to crash Chrome, and thus I can get no data out. What I'd like to do, is populate posts with the new products that have loaded after each iteration.
After 'Load More' has been clicked, I want to grab the text from the 50 products that have just loaded, append to my list and continue.
I can run the line posts = driver.find_elements_by_class_name("hotProductDetails")
within each iteration, but it grabs every element on the page every time, and really slows down the process.
Is there anyway of achieving this in Selenium or am I limited using this library?
This is the full script:
import csv
import time
from selenium import webdriver
import pandas as pd
def CeXScrape():
print('Loading Chrome...')
chromepath = r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe"
driver = webdriver.Chrome(chromepath)
driver.get(url)
print('Prepping Webpage...')
time.sleep(2)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
y = 0
BreakClause = ExceptCheck = False
while y < 1000 and BreakClause == False:
y += 1
time.sleep(0.5)
try:
driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()
ExceptCheck = False
print('Load Count', y, '...')
except:
if ExceptCheck: BreakClause = True
else: ExceptCheck = True
print('Load Count', y, '...Lag...')
time.sleep(2)
continue
print('Grabbing Elements...')
posts = driver.find_elements_by_class_name("hotProductDetails")
cats = driver.find_elements_by_class_name("superCatLink")
print('Generating lists...')
catlist = []
postlist = []
for cat in cats: catlist.append(cat.text)
print('Categories Complete...')
for post in posts: postlist.append(post.text)
print('Products Complete...')
return postlist, catlist
prods, cats = CeXScrape()
print('Extracting Lists...')
cat = []
subcat = []
prodname = []
sellprice = []
buycash = []
buyvoucher = []
for c in cats:
cat.append(c.split('/')[0])
subcat.append(c.split('/')[1])
for p in prods:
prodname.append(p.split('\n')[0])
sellprice.append(p.split('\n')[2])
if 'WeBuy' in p:
buycash.append(p.split('\n')[4])
buyvoucher.append(p.split('\n')[6])
else:
buycash.append('NaN')
buyvoucher.append('NaN')
print('Generating Dataframe...')
df = pd.DataFrame(
{'Category' : cat,
'Sub Category' : subcat,
'Product Name' : prodname,
'Sell Price' : sellprice,
'Cash Buy Price' : buycash,
'Voucher Buy Price' : buyvoucher})
print('Writing to csv...')
df.to_csv('Data.csv', sep=',', encoding='utf-8')
print('Completed!')
Use an XPATH and limit the products you get. If you get 50 products each time then use something like below
"(//div[@class='hotProductDetails'])[position() > {} and position() <= {}])".format ((page -1 ) * 50, page * 50)
This will give you 50
products every time and you increase the page # to get the next lot. Doing it all in one go will anyways crash it
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.