简体   繁体   中英

How to scrape urls on a paginated site with Selenium / Python

I want to get some link urls from a paginated site. I am following some tutorials as I am not very familiar with Selenium (or Python).

Anyway, with the loop below I am able to get the first url from each page, but there are 10 urls per page I need to get:

browser = webdriver.Firefox()
browser.get("http://www.scba.gov.ar/jurisprudencia/Navbar.asp?Busca=Fallos+Completos&SearchString=Inconstitucionalidad")
time.sleep(5)

x = 0
while (x < 5):
    print(browser.find_element_by_xpath('//a[contains(text(),"Completo")]')).get_attribute("href")
    browser.find_element_by_xpath("//td[2]/a").click() # Click on next button
    time.sleep(5)
    x += 1

To get all the urls per page I tried using find_elements_by_xpath() instead, but that function returns a list, and I get an error message saying list elements don't have the attribute get_attribute .

If I remove the get attribute part, I do get the 10 lines per page, but not in a url format. I get a list for each page with this format:

selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6dd0>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6d90>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6f90>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6f50>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621cc6ed0>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c62210>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c6a110>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c6a690>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c75950>, selenium.webdriver.remote.webelement.WebElement object at 0x7f3621c75990>

So, How is the correct way to build a loop that get the urls and then goes to the next page and so on?

Any help is appreciated.

Here is the complete idea and the implementation:

  • grab the maximum pages count from the paragraph at the bottom of the page
  • extract links from the current page
  • loop from the next page to the maximum page
  • in the loop, click the next page link and extract the links

Notes:

  • instead of the time.sleep() much better to explicitly wait for the desired element
  • to extract the maximum amount of pages ( 1910 in this case), here I'm using a regular expression \\d+ de (\\d+) with a capturing group (\\d+) where \\d+ matches one or more digits
  • to get href attribute from multiple elements, you just need to loop over them and call get_attribute() on each element (using "list comprehension" below)
  • I am not completely sure which links you want to grab, but I'm assuming these are the links to the files at the bottom of every block on a page (links to the files)

Code:

import re

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def extract_data(browser):
    links = browser.find_elements_by_xpath('//i[@class="RecordStats"]/a')
    return [link.get_attribute('href') for link in links]


browser = webdriver.Firefox()
browser.get("http://www.scba.gov.ar/jurisprudencia/Navbar.asp?Busca=Fallos+Completos&SearchString=Inconstitucionalidad")

# get max pages
element = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, "//p[@class='c'][last()]")))
max_pages = int(re.search(r'\d+ de (\d+)', element.text).group(1), re.UNICODE)

# extract from the current (1) page
print "Page 1"
print extract_data(browser)

# loop over the rest of the pages
for page in xrange(2, max_pages + 1):
    print "Page %d" % page

    next_page = browser.find_element_by_xpath("//table[last()]//td[last()]/a").click()

    print extract_data(browser)
    print "-----"

Prints:

Page 1
[u'http://www.scba.gov.ar/falloscompl/scba/2007/03-16/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2005/05-26/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2012/10-31/inicialesb.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2006/11-08/i68854.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2010/12-15/inicialesrp.doc', u'http://www.scba.gov.ar/falloscompl/scba/2012/07-04/a70660.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69656.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69691.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69693.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69772.doc']
Page 2
[u'http://www.scba.gov.ar/falloscompl/scba/2010/11-24/a69877.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-14/a68974.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68978.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68979.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68982.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68983.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a69181.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a69588.doc', u'http://www.scba.gov.ar/falloscompl/scba/2004/12-09/p72338.doc', u'http://www.scba.gov.ar/falloscompl/scba/2006/08-16/iniciales.doc']
-----
Page 3
[u'http://www.scba.gov.ar/falloscompl/scba/inter/2010/12-15/rp108872.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2007/02-14/i69014-2.doc', u'http://www.scba.gov.ar/falloscompl/scba/2011/05-04/a68445.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68976.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68977.doc', u'http://www.scba.gov.ar/falloscompl/scba/2010/07-07/a68981.doc', u'http://www.scba.gov.ar/falloscompl/scba/2004/12-10/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/2014/11-20/iniciales.doc', u'http://www.scba.gov.ar/falloscompl/scba/inter/2013/08-21/a72539.doc', u'http://www.scba.gov.ar/falloscompl/scba/2004/06-23/iniciales.doc']
-----
...

find_elements_by_xpath return a list of webelements which does not have get_attribute method. You need to perform get_attribite on individual element in that list

browser = webdriver.Firefox()
browser.get("http://www.scba.gov.ar/jurisprudencia/Navbar.asp?Busca=Fallos+Completos&SearchString=Inconstitucionalidad")
time.sleep(5)

elements = browser.find_element_by_xpath('//a[contains(text(),"Completo")]'))
for element in elements: 
    print(element.get_attribute("href"))
browser.find_element_by_xpath("//td[2]/a").click() # Click on next button
time.sleep(5)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM