[英]Scraping with bs4 and selenium, each loop returns the same data
I'm pretty new to web scraping and am trying to scrape backdated data from timeanddate.com and output it to a csv. 我正在使用 Selenium 來獲取每個日期的數據表。 我的代碼:
from bs4 import BeautifulSoup
from selenium import webdriver
import csv
def getData (url, month, year):
driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe')
driver.get(url)
Data = []
soup = BeautifulSoup(driver.page_source, "lxml")
for i in driver.find_element_by_id("wt-his-select").find_elements_by_tag_name("option"):
i.click()
table = soup.find('table', attrs={'id':'wt-his'})
for tr in table.find('tbody').find_all('tr'):
dict = {}
dict['time'] = tr.find('th').text.strip()
all_td = tr.find_all('td')
dict['humidity'] = all_td[5].text
Data.append(dict)
fileName = "output_month="+month+"_year="+year+".csv"
keys = Data[0].keys()
with open(fileName, 'w') as result:
dictWriter = csv.DictWriter(result, keys)
dictWriter.writeheader()
dictWriter.writerows(Data)
year_num = int(input("Enter your year to collect data from: "))
month_num = 1
year = str(year_num)
for i in range (0,12):
month = str(month_num)
url = "https://www.timeanddate.com/weather/usa/new-york/historic?month="+month+"&year="+year
data = getData(url, month, year)
print (data)
month_num += 1
我試圖從中抓取數據的表是天氣數據,我想從該月的每一天獲取濕度數據。
程序循環數月,但 output 是 1 月 1 日星期一的數據。雖然瀏覽器中的日期發生了變化,但每次都將相同的數據附加到文件中(當前 output )而不是每個新的日子都被附加(所需的 output)(所需的 output) )。 我無法弄清楚它為什么會這樣做,任何修復它的幫助將不勝感激。
問題是您只解析網站一次,即使網站隨每個日期選擇而變化。 但是,將解析移到 for 循環中是不夠的,因為還需要等到頁面加載完畢才能開始重新解析。
下面是我的解決方案。 有兩點需要注意:
# Necessary imports
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def getData (url, month, year):
driver = webdriver.Chrome('C:/Users/adam/Desktop/chromedriver.exe')
driver.get(url)
wait = WebDriverWait(driver, 5);
Data = []
for opt in driver.find_elements_by_css_selector("#wt-his-select option"):
opt.click()
# wait until the table title changes to selected date
wait.until(EC.text_to_be_present_in_element((By.ID, 'wt-his-title'), opt.text))
for tr in driver.find_elements_by_css_selector('#wt-his tbody tr'):
dict = {}
dict['time'] = tr.find_element_by_tag_name('th').text.strip()
# Note that I replaced 5 with 6 as nth-of-xxx starts indexing from 1
dict['humidity'] = tr.find_element_by_tag_name('td:nth-of-type(6)').text.strip()
Data.append(dict)
# continue with csv handlers ...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.