简体   繁体   中英

Scraping with bs4 and selenium, each loop returns the same data

I'm pretty new to web scraping and am trying to scrape backdated data from timeanddate.com and output it to a csv. I'm using Selenium to get the data table for each date. My code:

from bs4 import BeautifulSoup
from selenium import webdriver
import csv

def getData (url, month, year):
  driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe') 
  driver.get(url)
  Data = []
  soup = BeautifulSoup(driver.page_source, "lxml")
  for i in driver.find_element_by_id("wt-his-select").find_elements_by_tag_name("option"):
    i.click()
    table = soup.find('table', attrs={'id':'wt-his'})
    for tr in table.find('tbody').find_all('tr'):
       dict = {}
       dict['time'] = tr.find('th').text.strip()
       all_td = tr.find_all('td')
       dict['humidity'] = all_td[5].text
       Data.append(dict)

    fileName = "output_month="+month+"_year="+year+".csv"
    keys = Data[0].keys()
    with open(fileName, 'w') as result:
      dictWriter = csv.DictWriter(result, keys)
      dictWriter.writeheader()
      dictWriter.writerows(Data)

year_num = int(input("Enter your year to collect data from: "))
month_num = 1
year = str(year_num)
for i in range (0,12):
  month = str(month_num)
  url = "https://www.timeanddate.com/weather/usa/new-york/historic?month="+month+"&year="+year
  data = getData(url, month, year)
  print (data)
  month_num += 1

The table I'm trying to scrape data from is weather data and I want to get the humidity data from each day in the month.

The program cycles through the months but the output is the data for Mon, 1 Jan. Although the date changes in-browser, the same data is appended to the file each time ( current output ) rather than each new day being appended ( desired output ). I can't work out why it does this and any help fixing it would be much appreciated.

The problem is that you parse the website only once even though the site changes with each date selection. However, it is not enough to move parsing inside the for loop as it is also necessary to wait until the page is loaded before starting re-parsing.

Below is my solution. There are two things to note:

  1. I am making use of the WebDriverWait and expected_conditions provided built-in with Selenium
  2. I prefer finding by CSS selectors, which greatly simplifies syntax. This awesome game can help you learn them
# Necessary imports
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def getData (url, month, year):
  driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe') 
  driver.get(url)
  wait = WebDriverWait(driver, 5);
  Data = []
  for opt in driver.find_elements_by_css_selector("#wt-his-select option"):
    opt.click()
    # wait until the table title changes to selected date
    wait.until(EC.text_to_be_present_in_element((By.ID, 'wt-his-title'), opt.text))
    for tr in driver.find_elements_by_css_selector('#wt-his tbody tr'):
       dict = {}
       dict['time'] = tr.find_element_by_tag_name('th').text.strip()
       # Note that I replaced 5 with 6 as nth-of-xxx starts indexing from 1
       dict['humidity'] = tr.find_element_by_tag_name('td:nth-of-type(6)').text.strip()
       Data.append(dict)
       # continue with csv handlers ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM