How to scrape urls on a paginated site with Selenium / Python

Question

I want to get some link urls from a paginated site. I am following some tutorials as I am not very familiar with Selenium (or Python).

Anyway, with the loop below I am able to get the first url from each page, but there are 10 urls per page I need to get:

browser = webdriver.Firefox()
browser.get("http://www.scba.gov.ar/jurisprudencia/Navbar.asp?Busca=Fallos+Completos&SearchString=Inconstitucionalidad")
time.sleep(5)

x = 0
while (x < 5):
    print(browser.find_element_by_xpath('//a[contains(text(),"Completo")]')).get_attribute("href")
    browser.find_element_by_xpath("//td[2]/a").click() # Click on next button
    time.sleep(5)
    x += 1

To get all the urls per page I tried using find_elements_by_xpath() instead, but that function returns a list, and I get an error message saying list elements don't have the attribute get_attribute .

If I remove the get attribute part, I do get the 10 lines per page, but not in a url format. I get a list for each page with this format:

selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6dd0>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6d90>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6f90>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6f50>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6ed0>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c62210>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c6a110>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c6a690>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c75950>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c75990>

So, How is the correct way to build a loop that get the urls and then goes to the next page and so on?

Any help is appreciated.

Answer 1

Here is the complete idea and the implementation:

grab the maximum pages count from the paragraph at the bottom of the page
extract links from the current page
loop from the next page to the maximum page
in the loop, click the next page link and extract the links

Notes:

instead of the time.sleep() much better to explicitly wait for the desired element
to extract the maximum amount of pages ( 1910 in this case), here I'm using a regular expression \\d+ de (\\d+) with a capturing group (\\d+) where \\d+ matches one or more digits
to get href attribute from multiple elements, you just need to loop over them and call get_attribute() on each element (using "list comprehension" below)
I am not completely sure which links you want to grab, but I'm assuming these are the links to the files at the bottom of every block on a page (links to the files)

Code:

import re

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def extract_data(browser):
    links = browser.find_elements_by_xpath('//i[@class="RecordStats"]/a')
    return [link.get_attribute('href') for link in links]


browser = webdriver.Firefox()
browser.get("http://www.scba.gov.ar/jurisprudencia/Navbar.asp?Busca=Fallos+Completos&SearchString=Inconstitucionalidad")

# get max pages
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, "//p[@class='c'][last()]")))
max_pages = int(re.search(r'\d+ de (\d+)', element.text).group(1), re.UNICODE)

# extract from the current (1) page
print "Page 1"
print extract_data(browser)

# loop over the rest of the pages
for page in xrange(2, max_pages + 1):
    print "Page %d" % page

    next_page = browser.find_element_by_xpath("//table[last()]//td[last()]/a").click()

    print extract_data(browser)
    print "-----"

Prints:

Page 1
[u'http://www.scba.gov.ar/falloscompl/scba/2007/03-16/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2005/05-26/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2012/10-31/inicialesb.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2006/11-08/i68854.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2010/12-15/inicialesrp.doc', u'http://www.scba.gov.ar/falloscompl/scba/2012/07-04/a70660.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69656.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69691.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69693.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69772.doc']
Page 2
[u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69877.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-14/a68974.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68978.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68979.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68982.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68983.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a69181.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a69588.doc', u'http://www.scba.gov.ar/falloscompl/scba/2004/12-09/p72338.doc', u'http://www.scba.gov.ar/falloscompl/scba/2006/08-16/iniciales.doc']
-----
Page 3
[u'http://www.scba.gov.ar/falloscompl/scba/inter/2010/12-15/rp108872.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2007/02-14/i69014-2.doc', u'http://www.scba.gov.ar/falloscompl/scba/2011/05-04/a68445.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68976.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68977.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68981.doc', u'http://www.scba.gov.ar/falloscompl/scba/2004/12-10/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/2014/11-20/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2013/08-21/a72539.doc', u'http://www.scba.gov.ar/falloscompl/scba/2004/06-23/iniciales.doc']
-----
...

Answer 2

find_elements_by_xpath return a list of webelements which does not have get_attribute method. You need to perform get_attribite on individual element in that list

browser = webdriver.Firefox()
browser.get("http://www.scba.gov.ar/jurisprudencia/Navbar.asp?Busca=Fallos+Completos&SearchString=Inconstitucionalidad")
time.sleep(5)

elements = browser.find_element_by_xpath('//a[contains(text(),"Completo")]'))
for element in elements: 
    print(element.get_attribute("href"))
browser.find_element_by_xpath("//td[2]/a").click() # Click on next button
time.sleep(5)

How to scrape urls on a paginated site with Selenium / Python

Question

2 answers

solution1
5 ACCPTED 2015-01-07 01:57:50

solution2
1 2015-01-07 01:58:31

How to scrape urls on a paginated site with Selenium / Python

Question

2 answers

solution1 5 ACCPTED 2015-01-07 01:57:50

solution2 1 2015-01-07 01:58:31

solution1
5 ACCPTED 2015-01-07 01:57:50

solution2
1 2015-01-07 01:58:31