简体   繁体   中英

Improve Web Scraping for Elements in a Container Using Selenium

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:

firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)

but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.

here is a snippet of the code:

a = []
b = []
c = []
d = []
e = []
f = []
while True:
    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
    for item in container:
        time.sleep(2)
        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
        for i in A:
            a.append(i,text)
        time.sleep(2)
        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
        for j in B:
            b.append(j.text)
        time.sleep(3)
        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
        for k in C:
            c.append(k.text)
        time.sleep(3)
        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
        for l in D:
            d.append(l.text)
        time.sleep(3)
        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
        for m in E:
            e.append(m.text)

    try:
        time.sleep(2)
        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
        next.click()
        time.sleep(2)
        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
    except (ElementClickInterceptedException,NoSuchElementException) as e:
        break

Here is an edited version, but speed does not improve.

========================================================================
while True:
    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
    for item in container:
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))
        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
        for i in A:
            a.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))
        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
        for i in B:
            b.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
        for i in C:
            c.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
        for i in D:
            d.append(i.text)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
        for i in E:
            e.append(i.text)

    try:
        #time.sleep(2)
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
        next.click()
        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
    except (ElementClickInterceptedException,NoSuchElementException) as e:
        break

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash

Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.

Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.

Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.

Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.

It is a suggestion to evaluate scrapy-splash.

I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.

At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.

But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.

If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.

Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM