簡體   English   中英

從多個頁面抓取天氣數據

[英]Scraping Weather Data from multiple pages

我是 python 的新手

我想從網站“http://www.estesparkweather.net/archive_reports.php?date=200901 ”抓取天氣數據我必須抓取從 2009-01-01 到 2018 年的每一天天氣數據的所有可用屬性-10-28 我必須將抓取的數據表示為 pandas dataframe object。

下面應該是Dataframe具體詳情

Expected column names (order dose not matter):

 ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']

Each record in the dataframe corresponds to weather details of a given day
The index column is date-time format (yyyy-mm-dd)
I need to perform necessary data cleaning and type cast each attributes to relevent data type

抓取后,我需要將 dataframe 保存為名為“dataframe.pk”的 pickle 文件

下面是我最初嘗試使用 Beautifulsoup 讀取頁面的代碼,但是每月有多個頁面,我不確定如何從 2009 年 1 月到 2018 年 10 月循環 url 並將該內容放入湯中,有人可以幫忙嗎請:

***import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
import numpy as np
url = "http://www.estesparkweather.net/archive_reports.php?date=200901"
page = requests.get(url)
soup=BeautifulSoup(page.content,"html.parser")
type(soup)
bs4.BeautifulSoup
# Get the title
title = soup.title
print(title)
# Print out the text
text = soup.get_text()
print(soup.text)

# Print the first 10 rows for sanity check
rows = soup.find_all('tr')
print(rows[:10])***

要閱讀 2009-01-01 到 2018-10-28 時間范圍內的信息,您必須了解 URL 模式

http://www.estesparkweather.net/archive_reports.php?date=YYYYMM

例子:

http://www.estesparkweather.net/archive_reports.php?date=201008

因此,您需要創建一個嵌套循環來讀取每個年/月組合的數據。

就像是:

URL_TEMPLATE = 'http://www.estesparkweather.net/archive_reports.php?date={}{}'
for year in range(2009,2018):
  for month in range(1,12):
     url = URL_TEMPLATE.format(year,month) 
     # TODO implement the actual scraping of a single page
     # Note that you will need to pad single digit month with zeros

我只是嘗試使用您最初的問題陳述從頭開始編寫它,對我來說效果很好

range_date = pd.date_range(start = '1/1/2009',end = '11/01/2018',freq = 'M')

dates = [str(i)[:4] + str(i)[5:7] for i in range_date]

lst = []

index = []

for j in tqdm(range(len(dates))):

      url = "http://www.estesparkweather.net/archive_reports.php?date="+ 
      dates[j] 

      page = requests.get(url)
      soup = BeautifulSoup(page.content, 'html.parser')
      table = soup.find_all('table')
    

      data_parse = [row.text.splitlines() for row in table]
      data_parse = data_parse[:-9] 

for k in range(len(data_parse)):
    data_parse[k] = data_parse[k][2:len(data_parse[k]):3]



for l in range(len(data_parse)):
    str_l = [('.'.join(re.findall("\d+",str(data_parse[l][k].split()[:5])))) for k in range(len(parsed_data[l]))]
    lst.append(str_l)
    index.append(dates[j] + str_l[0])

d1_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
data = [lst[i][1:] for i in range(len(lst)) if len(lst[i][1:]) == 19]

d2_index = [datetime.strptime(str(d1_index[i]), '%Y%m%d').strftime('%Y-%m-%d') for i in range(len(d1_index))]

desired_df = pd.DataFrame(data, index = d2_index)

這應該是您想要的數據框,您可以對此進一步進行所需的操作

** 您將需要導入所需的模塊 ** 這是從 2009-0-01 到 2018-10-31 提取的數據。您可能需要刪除最后 3 條記錄以獲取到 2018-10-28

下面是對我有用的

import bs4
from bs4 import BeautifulSoup
import csv
import requests
import time
import pandas as pd
import urllib
import re
import pickle
Dates_r = pd.date_range(start = '01/01/2009', end = '11/01/2018', freq = 'M')
dates = [str(i)[:4] + str(i)[5:7] for i in Dates_r]
dates[0:5]
df_list = []
index = []
for k in range(len(dates)):
    url = "http://www.estesparkweather.net/archive_reports.php?date="
    url += dates[k]
    page = requests.get(url)
    soup =  BeautifulSoup(page.content,'html.parser')
    table = soup.find_all('table')
    raw_data = [row.text.splitlines() for row in table]
    raw_data = raw_data[:-9]
    for i in range(len(raw_data)):
        raw_data[i] = raw_data[i][2:len(raw_data[i]):3]
    for i in range(len(raw_data)):
        c = ['.'.join(re.findall("\d+",str(raw_data[i][j].split()[:5])))for j in range(len(raw_data[i]))]
        if len(c):
            df_list.append(c)
            index.append(dates[k] + c[0])
        f_index = [index[i] for i in range(len(index)) if len(index[i]) > 6]
        data = [df_list[i][1:] for i in range(len(df_list)) if len(df_list[i][1:]) == 19]
from datetime import datetime
final_index = [datetime.strptime(str(f_index[i]), '%Y%m%d').strftime('%Y-%m-%d') for i in range(len(f_index))]
columns =  ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
final_index2 = final_index.copy()
data2 = data.copy()
data2.pop()
data2.pop()
data2.pop()
final_index2.pop()
final_index2.pop()
final_index2.pop()
desired_df = pd.DataFrame(data2, index = final_index2)
desired_df.columns =  ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']
df = desired_df.apply(pd.to_numeric)
df.index = pd.to_datetime(df.index)
import pickle
with open("dataframe.pk", "wb") as file:
    pickle.dump(df, file)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM