简体   繁体   English

在Python中使用Selenium进行网络抓取

[英]Webscraping using Selenium in Python

I am trying to scrape data from the Sunshine List website ( http://www.sunshinelist.ca/ ) using the BeautifulSoup library and the Selenium package (in order to deal with the 'Next' button on the webpage). 我正在尝试使用BeautifulSoup库和Selenium包从Sunshine List网站( http://www.sunshinelist.ca/ )上抓取数据(以便处理网页上的“下一步”按钮)。 I know there are several related posts but I just can't identify where and how I should explicitly ask the driver to wait. 我知道有几篇相关的文章,但是我无法确定应该在哪里以及如何明确要求驾驶员等待。

Error: StaleElementReferenceException: Message: The element reference of stale: either the element is no longer attached to the DOM or the page has been refreshed 错误:StaleElementReferenceException:消息:stale的元素引用:元素不再附加到DOM或页面已刷新

This is the code I have written: 这是我编写的代码:

import numpy as np
import pandas as pd
import requests
import re
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

ffx_bin = FirefoxBinary(r'C:\Users\BhagatM\AppData\Local\Mozilla Firefox\firefox.exe')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)
driver.get("http://www.sunshinelist.ca/")
driver.maximize_window()

tablewotags1=[]

while True:
    divs = driver.find_element_by_id('datatable-disclosures')
    divs1=divs.find_elements_by_tag_name('tbody')

    for d1 in divs1:
        div2=d1.find_elements_by_tag_name('tr')
        for d2 in div2:
            tablewotags1.append(d2.text)

    try:
        driver.find_element_by_link_text('Next →').click()
    except NoSuchElementException:
        break

year1=tablewotags1[0::10]
name1=tablewotags1[3::10]
position1=tablewotags1[4::10]
employer1=tablewotags1[1::10]  


df1=pd.DataFrame({'Year':year1,'Name':name1,'Position':position1,'Employer':employer1})
df1.to_csv('Sunshine List-1.csv', index=False)

I think you just need to point to the correct firefox Binary. 我认为您只需要指向正确的firefox Binary。 Also, Which version of Firefox are you using? 另外,您正在使用哪个版本的Firefox? Looks like it's one of the newer versions, this should do if thats the case. 看起来它是较新的版本之一,如果是这样,应该这样做。

ffx_bin = FirefoxBinary(r'pathtoyourfirefox')
ffx_caps = DesiredCapabilities.FIREFOX
ffx_caps['marionette'] = True
driver = webdriver.Firefox(capabilities=ffx_caps,firefox_binary=ffx_bin)

Cheers 干杯

EDIT: So in order to answer your new enquery, "why is not writting the CVS" you should do so like this: 编辑:因此,为了回答您的新查询,“为什么不写CVS”,您应该这样做:

import csv   # You are missing this import
ls_general_list = []

def csv_for_me(list_to_csv):
    with open(pathtocsv, 'a', newline='') as csvfile:
        sw = csv.writer(csvfile, delimeter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
        for line in list_to_csv:
            for data in line:
                sw.writerow(data)

Then replace this in you code, df=pd.DataFrame({'Year':year,'Name':name,'Position':position,'Employer':employer}) 然后将其替换为您的代码df=pd.DataFrame({'Year':year,'Name':name,'Position':position,'Employer':employer})

for this one, ls.general_list.append(('Year':year,'Name':name,'Position':position,'Employer':employer)) 为此, ls.general_list.append(('Year':year,'Name':name,'Position':position,'Employer':employer))

then do so like this, csv_for_me(ls_general_list) 然后像这样csv_for_me(ls_general_list)

Please accept the answer if it's satisfactory and now you have a csv 如果满意,请接受答案,而您现在拥有一个csv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM