简体   繁体   English

Python Selenium报废崩溃,我可以在网页的一部分中找到元素吗?

[英]Python Selenium Scraping Crashes, Can I Find Elements For Part of The Web Page?

I am trying to scrape some data from a website. 我正在尝试从网站上抓取一些数据。 This website has a 'load more products' button. 该网站上有一个“加载更多产品”按钮。 I'm using: 我正在使用:

driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()

to hit the button and this loops for a set number of iterations. 按下按钮,然后循环进行一定次数的迭代。

The problem I'm running into is once those number of iterations have been completed, I want to extract text from the webpage using: 我遇到的问题是,一旦完成这些迭代次数,我想使用以下方法从网页中提取文本:

posts = driver.find_elements_by_class_name("hotProductDetails")

However, this seems to crash Chrome, and thus I can get no data out. 但是,这似乎会使Chrome崩溃,因此我无法获取任何数据。 What I'd like to do, is populate posts with the new products that have loaded after each iteration. 我想做的是用每次迭代后加载的新产品填充帖子。

After 'Load More' has been clicked, I want to grab the text from the 50 products that have just loaded, append to my list and continue. 单击“加载更多”后,我想从刚加载的50种产品中获取文本,追加到我的列表中并继续。

I can run the line posts = driver.find_elements_by_class_name("hotProductDetails") within each iteration, but it grabs every element on the page every time, and really slows down the process. 我可以在每次迭代中运行以下代码行posts = driver.find_elements_by_class_name("hotProductDetails") ,但它每次都会posts = driver.find_elements_by_class_name("hotProductDetails")页面上的每个元素,并确实减慢了该过程。

Is there anyway of achieving this in Selenium or am I limited using this library? 无论如何在Selenium中实现此目标还是使用该库受到限制?

This is the full script: 这是完整的脚本:

import csv
import time
from selenium import webdriver
import pandas as pd

def CeXScrape():
    print('Loading Chrome...')
    chromepath = r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe"
    driver = webdriver.Chrome(chromepath)

    driver.get(url)

    print('Prepping Webpage...')    
    time.sleep(2)    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    y = 0
    BreakClause = ExceptCheck = False    
    while y < 1000 and BreakClause == False:
        y += 1
        time.sleep(0.5)
        try:
            driver.find_element_by_xpath("""//*[@id="showmoreresult"]""").click()
            ExceptCheck = False
            print('Load Count', y, '...')
        except: 
            if ExceptCheck: BreakClause = True
            else: ExceptCheck = True
            print('Load Count', y, '...Lag...')
            time.sleep(2)
            continue

    print('Grabbing Elements...')
    posts = driver.find_elements_by_class_name("hotProductDetails")
    cats = driver.find_elements_by_class_name("superCatLink")

    print('Generating lists...')
    catlist = []
    postlist = []    
    for cat in cats: catlist.append(cat.text)
    print('Categories Complete...')
    for post in posts: postlist.append(post.text)
    print('Products Complete...')    
    return postlist, catlist

prods, cats = CeXScrape()

print('Extracting Lists...')

cat = []
subcat = []
prodname = []
sellprice = []
buycash = []
buyvoucher = []

for c in cats: 
    cat.append(c.split('/')[0])
    subcat.append(c.split('/')[1])

for p in prods:
    prodname.append(p.split('\n')[0])
    sellprice.append(p.split('\n')[2])
    if 'WeBuy' in p:
        buycash.append(p.split('\n')[4])
        buyvoucher.append(p.split('\n')[6])
    else:
        buycash.append('NaN')
        buyvoucher.append('NaN')    

print('Generating Dataframe...')

df = pd.DataFrame(
        {'Category' : cat,
         'Sub Category' : subcat,
         'Product Name' : prodname,
         'Sell Price' : sellprice,
         'Cash Buy Price' : buycash,
         'Voucher Buy Price' : buyvoucher})

print('Writing to csv...')

df.to_csv('Data.csv', sep=',', encoding='utf-8')

print('Completed!')

Use an XPATH and limit the products you get. 使用XPATH并限制您获得的产品。 If you get 50 products each time then use something like below 如果您每次获得50种产品,则使用如下所示的方法

"(//div[@class='hotProductDetails'])[position() > {} and position() <= {}])".format ((page -1 ) * 50, page * 50)

This will give you 50 products every time and you increase the page # to get the next lot. 每次将为您提供50产品,您可以增加页面编号以获得下一个批次。 Doing it all in one go will anyways crash it 一劳永逸地将其崩溃

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM