简体   繁体   English

用bs4和selenium刮,每次循环返回相同的数据

[英]Scraping with bs4 and selenium, each loop returns the same data

I'm pretty new to web scraping and am trying to scrape backdated data from timeanddate.com and output it to a csv. I'm pretty new to web scraping and am trying to scrape backdated data from timeanddate.com and output it to a csv. I'm using Selenium to get the data table for each date.我正在使用 Selenium 来获取每个日期的数据表。 My code:我的代码:

from bs4 import BeautifulSoup
from selenium import webdriver
import csv

def getData (url, month, year):
  driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe') 
  driver.get(url)
  Data = []
  soup = BeautifulSoup(driver.page_source, "lxml")
  for i in driver.find_element_by_id("wt-his-select").find_elements_by_tag_name("option"):
    i.click()
    table = soup.find('table', attrs={'id':'wt-his'})
    for tr in table.find('tbody').find_all('tr'):
       dict = {}
       dict['time'] = tr.find('th').text.strip()
       all_td = tr.find_all('td')
       dict['humidity'] = all_td[5].text
       Data.append(dict)

    fileName = "output_month="+month+"_year="+year+".csv"
    keys = Data[0].keys()
    with open(fileName, 'w') as result:
      dictWriter = csv.DictWriter(result, keys)
      dictWriter.writeheader()
      dictWriter.writerows(Data)

year_num = int(input("Enter your year to collect data from: "))
month_num = 1
year = str(year_num)
for i in range (0,12):
  month = str(month_num)
  url = "https://www.timeanddate.com/weather/usa/new-york/historic?month="+month+"&year="+year
  data = getData(url, month, year)
  print (data)
  month_num += 1

The table I'm trying to scrape data from is weather data and I want to get the humidity data from each day in the month.我试图从中抓取数据的表是天气数据,我想从该月的每一天获取湿度数据。

The program cycles through the months but the output is the data for Mon, 1 Jan. Although the date changes in-browser, the same data is appended to the file each time ( current output ) rather than each new day being appended ( desired output ).程序循环数月,但 output 是 1 月 1 日星期一的数据。虽然浏览器中的日期发生了变化,但每次都将相同的数据附加到文件中(当前 output )而不是每个新的日子都被附加(所需的 output)(所需的 output) )。 I can't work out why it does this and any help fixing it would be much appreciated.我无法弄清楚它为什么会这样做,任何修复它的帮助将不胜感激。

The problem is that you parse the website only once even though the site changes with each date selection.问题是您只解析网站一次,即使网站随每个日期选择而变化。 However, it is not enough to move parsing inside the for loop as it is also necessary to wait until the page is loaded before starting re-parsing.但是,将解析移到 for 循环中是不够的,因为还需要等到页面加载完毕才能开始重新解析。

Below is my solution.下面是我的解决方案。 There are two things to note:有两点需要注意:

  1. I am making use of the WebDriverWait and expected_conditions provided built-in with Selenium我正在使用 Selenium 内置的WebDriverWaitexpected_conditions
  2. I prefer finding by CSS selectors, which greatly simplifies syntax.我更喜欢通过 CSS 选择器查找,这大大简化了语法。 This awesome game can help you learn them这个很棒的游戏可以帮助您学习它们
# Necessary imports
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

def getData (url, month, year):
  driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe') 
  driver.get(url)
  wait = WebDriverWait(driver, 5);
  Data = []
  for opt in driver.find_elements_by_css_selector("#wt-his-select option"):
    opt.click()
    # wait until the table title changes to selected date
    wait.until(EC.text_to_be_present_in_element((By.ID, 'wt-his-title'), opt.text))
    for tr in driver.find_elements_by_css_selector('#wt-his tbody tr'):
       dict = {}
       dict['time'] = tr.find_element_by_tag_name('th').text.strip()
       # Note that I replaced 5 with 6 as nth-of-xxx starts indexing from 1
       dict['humidity'] = tr.find_element_by_tag_name('td:nth-of-type(6)').text.strip()
       Data.append(dict)
       # continue with csv handlers ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM