Webscraping using Selenium in Python

Question

I am trying to scrape data from the Sunshine List website ( http://www.sunshinelist.ca/ ) using the BeautifulSoup library and the Selenium package (in order to deal with the 'Next' button on the webpage). I know there are several related posts but I just can't identify where and how I should explicitly ask the driver to wait.

Error: StaleElementReferenceException: Message: The element reference of stale: either the element is no longer attached to the DOM or the page has been refreshed

This is the code I have written:

import numpy as np
import pandas as pd
import requests
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

ffx_bin = FirefoxBinary(r'C:\Users\BhagatM\AppData\Local\Mozilla Firefox\firefox.exe')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)
driver.get("http://www.sunshinelist.ca/")
driver.maximize_window()

tablewotags1=[]

while True:
    divs = driver.find_element_by_id('datatable-disclosures')
    divs1=divs.find_elements_by_tag_name('tbody')

    for d1 in divs1:
        div2=d1.find_elements_by_tag_name('tr')
        for d2 in div2:
            tablewotags1.append(d2.text)

    try:
        driver.find_element_by_link_text('Next →').click()
    except NoSuchElementException:
        break

year1=tablewotags1[0::10]
name1=tablewotags1[3::10]
position1=tablewotags1[4::10]
employer1=tablewotags1[1::10]  


df1=pd.DataFrame({'Year':year1,'Name':name1,'Position':position1,'Employer':employer1})
df1.to_csv('Sunshine List-1.csv', index=False)

Answer 1

I think you just need to point to the correct firefox Binary. Also, Which version of Firefox are you using? Looks like it's one of the newer versions, this should do if thats the case.

ffx_bin = FirefoxBinary(r'pathtoyourfirefox')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)

Cheers

EDIT: So in order to answer your new enquery, "why is not writting the CVS" you should do so like this:

import csv   # You are missing this import
ls_general_list = []

def csv_for_me(list_to_csv):
    with open(pathtocsv, 'a', newline='') as csvfile:
        sw = csv.writer(csvfile, delimeter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        for line in list_to_csv:
            for data in line:
                sw.writerow(data)

Then replace this in you code, df=pd.DataFrame({'Year':year,'Name':name,'Position':position,'Employer':employer})

for this one, ls.general_list.append(('Year':year,'Name':name,'Position':position,'Employer':employer))

then do so like this, csv_for_me(ls_general_list)

Please accept the answer if it's satisfactory and now you have a csv

Webscraping using Selenium in Python

Question

1 answers

solution1
0 ACCPTED 2017-10-26 16:20:04

Webscraping using Selenium in Python

Question

1 answers

solution1 0 ACCPTED 2017-10-26 16:20:04

solution1
0 ACCPTED 2017-10-26 16:20:04