简体   繁体   English

使用 Selenium + Python 进行网页抓取

[英]Web scraping with Selenium + Python

The objective is to scrape the historical weather from http://www.weather.gov.sg/climate-historical-daily/目标是从http://www.weather.gov.sg/climate-historical-daily/抓取历史天气

To obtain the data for the particular month, first have to select the cityname, month and year要获取特定月份的数据,首先必须选择城市名称、月份和年份

There are 63 cities,12 months and 41 years有63个城市,12个月,41年

city = [el.text for el in driver.find_elements_by_xpath("/html/body/div/div/div[3]/div[1]/div[1]/div/div/ul/li/a")]
len(city)
Out[182]: 63

month = [el.text for el in driver.find_elements_by_xpath('//*[@id="monthDiv"]/ul/li')]
year = [el.text for el in driver.find_elements_by_xpath('//*[@id="yearDiv"]/ul/li')]

click display button单击显示按钮

button = WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.ID, "display")))
button.click()

How to select option from these bootstrap drop downlist and copy the weather data in如何从这些引导程序下拉列表中选择选项并将天气数据复制到

<table class="table table-calendar"><colgroup>
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
                <col width="10%">
              </colgroup><thead><tr><th>Date</th><th>Daily Rainfall Total (mm)</th><th>Highest &nbsp;30-min Rainfall (mm)</th><th>Highest &nbsp;60-min Rainfall (mm)</th><th>Highest 120-min Rainfall (mm)</th><th>Mean Temperature (°C)</th><th>Maximum Temperature (°C)</th><th>Minimum Temperature (°C)</th><th>Mean Wind Speed (km/h)</th><th>Max Wind Speed (km/h)</th></tr></thead><tbody><tr><td>1 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.5</td><td align="center">30.4</td><td align="center">26.0</td><td align="center">12.3</td><td align="center">40.7</td></tr><tr><td>2 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.9</td><td align="center">31.7</td><td align="center">26.9</td><td align="center">10.3</td><td align="center">31.5</td></tr><tr><td>3 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.2</td><td align="center">31.7</td><td align="center">27.2</td><td align="center">12.0</td><td align="center">31.5</td></tr><tr><td>4 Aug</td><td align="center">4.8</td><td align="center">4.6</td><td align="center">4.8</td><td align="center">4.8</td><td align="center">27.9</td><td align="center">30.2</td><td align="center">24.1</td><td align="center">8.8</td><td align="center">44.4</td></tr><tr><td>5 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.8</td><td align="center">31.8</td><td align="center">26.7</td><td align="center">8.6</td><td align="center">25.9</td></tr><tr><td>6 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.2</td><td align="center">31.4</td><td align="center">27.6</td><td align="center">8.1</td><td align="center">27.8</td></tr><tr><td>7 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.4</td><td align="center">32.7</td><td align="center">27.3</td><td align="center">11.4</td><td align="center">29.6</td></tr><tr><td>8 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.7</td><td align="center">32.9</td><td align="center">27.6</td><td align="center">11.0</td><td align="center">27.8</td></tr><tr><td>9 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.6</td><td align="center">32.8</td><td align="center">27.7</td><td align="center">12.3</td><td align="center">31.5</td></tr><tr><td>10 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.7</td><td align="center">33.0</td><td align="center">27.8</td><td align="center">12.9</td><td align="center">33.3</td></tr><tr><td>11 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.5</td><td align="center">32.7</td><td align="center">28.2</td><td align="center">11.0</td><td align="center">31.5</td></tr><tr><td>12 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">27.9</td><td align="center">30.0</td><td align="center">26.8</td><td align="center">8.7</td><td align="center">31.5</td></tr><tr><td>13 Aug</td><td align="center">34.6</td><td align="center">22.2</td><td align="center">30.8</td><td align="center">33.4</td><td align="center">28.3</td><td align="center">32.2</td><td align="center">22.5</td><td align="center">6.4</td><td align="center">40.7</td></tr><tr><td>14 Aug</td><td align="center">13.8</td><td align="center">7.2</td><td align="center">12.2</td><td align="center">12.6</td><td align="center">25.9</td><td align="center">28.5</td><td align="center">23.4</td><td align="center">5.1</td><td align="center">35.2</td></tr><tr><td>15 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.0</td><td align="center">31.5</td><td align="center">24.6</td><td align="center">6.5</td><td align="center">25.9</td></tr><tr><td>16 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.0</td><td align="center">30.0</td><td align="center">26.4</td><td align="center">8.0</td><td align="center">27.8</td></tr><tr><td>17 Aug</td><td align="center">5.2</td><td align="center">4.0</td><td align="center">4.6</td><td align="center">4.6</td><td align="center">27.4</td><td align="center">31.4</td><td align="center">24.3</td><td align="center">6.2</td><td align="center">29.6</td></tr><tr><td>18 Aug</td><td align="center">2.0</td><td align="center">1.0</td><td align="center">1.0</td><td align="center">2.0</td><td align="center">27.1</td><td align="center">30.1</td><td align="center">25.3</td><td align="center">6.4</td><td align="center">48.2</td></tr><tr><td>19 Aug</td><td align="center">1.8</td><td align="center">1.4</td><td align="center">1.6</td><td align="center">1.8</td><td align="center">28.0</td><td align="center">31.3</td><td align="center">25.4</td><td align="center">5.7</td><td align="center">25.9</td></tr><tr><td>20 Aug</td><td align="center">2.2</td><td align="center">2.0</td><td align="center">2.0</td><td align="center">2.0</td><td align="center">28.1</td><td align="center">31.9</td><td align="center">25.5</td><td align="center">10.6</td><td align="center">37.0</td></tr><tr><td>21 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.6</td><td align="center">33.0</td><td align="center">27.7</td><td align="center">15.2</td><td align="center">31.5</td></tr><tr><td>22 Aug</td><td align="center">2.0</td><td align="center">1.4</td><td align="center">1.6</td><td align="center">1.6</td><td align="center">27.9</td><td align="center">32.1</td><td align="center">25.3</td><td align="center">9.3</td><td align="center">38.9</td></tr><tr><td>23 Aug</td><td align="center">24.4</td><td align="center">8.2</td><td align="center">11.2</td><td align="center">15.2</td><td align="center">25.6</td><td align="center">27.0</td><td align="center">23.0</td><td align="center">5.1</td><td align="center">48.2</td></tr><tr><td>24 Aug</td><td align="center">0.0</td><td align="center">0.2</td><td align="center">0.2</td><td align="center">0.2</td><td align="center">28.1</td><td align="center">32.4</td><td align="center">24.5</td><td align="center">9.0</td><td align="center">33.3</td></tr><tr><td>25 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">27.9</td><td align="center">31.9</td><td align="center">25.7</td><td align="center">8.6</td><td align="center">44.4</td></tr><tr><td>26 Aug</td><td align="center">4.6</td><td align="center">4.4</td><td align="center">4.6</td><td align="center">4.6</td><td align="center">27.0</td><td align="center">31.3</td><td align="center">24.0</td><td align="center">9.6</td><td align="center">51.9</td></tr><tr><td>27 Aug</td><td align="center">1.4</td><td align="center">1.4</td><td align="center">1.4</td><td align="center">1.4</td><td align="center">27.8</td><td align="center">30.4</td><td align="center">25.6</td><td align="center">8.4</td><td align="center">27.8</td></tr><tr><td>28 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.9</td><td align="center">32.3</td><td align="center">26.2</td><td align="center">9.6</td><td align="center">33.3</td></tr><tr><td>29 Aug</td><td align="center">6.6</td><td align="center">2.8</td><td align="center">3.4</td><td align="center">4.8</td><td align="center">27.2</td><td align="center">30.8</td><td align="center">25.1</td><td align="center">8.0</td><td align="center">-</td></tr><tr><td>30 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">28.6</td><td align="center">32.1</td><td align="center">26.4</td><td align="center">11.2</td><td align="center">35.2</td></tr><tr><td>31 Aug</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">0.0</td><td align="center">29.0</td><td align="center">32.2</td><td align="center">27.2</td><td align="center">11.7</td><td align="center">29.6</td></tr></tbody></table>

Here's a different approach.这是一种不同的方法。

Why not get all the .csv files for all the cities and all the dates?为什么不获取所有城市和所有日期的所有.csv文件? The link to the file is static and uses the code of the city that's in the drop-down menu.该文件的链接是静态的,并使用下拉菜单中的城市代码。 You can parse this, grab the code, put it in the url and get the .csv file.您可以解析它,获取代码,将其放入 url 并获取.csv文件。 Oh, and you have to loop over all the years too.哦,你也必须遍历所有这些年。

By the way, not all cities have data for the past 40 years.顺便说一下,并非所有城市都有过去 40 年的数据。

import re
import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "PostmanRuntime/7.26.5",
    "Accept": "*/*",
    "Accept-Encoding": "gzip, deflate, br",
}

response = requests.get("http://www.weather.gov.sg/climate-historical-daily/")

soup = BeautifulSoup(response.text, "html.parser").find("ul", {"class": "dropdown-menu long-dropdown"}).find_all("li")
cities_and_codes = {
    t.find("a").getText(strip=True): re.search(r'(S\d+)', t.find("a")['onclick']).group(1)
    for t in soup
}


def get_dates():
    yield from (
        [(y, f"0{m}" if m < 10 else m) for y in range(1980, 2021) for m in range(1, 13)]
    )


files_url = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_"
for city, code in cities_and_codes.items():
    for date in get_dates():
        year, month = date
        csv_url = f"{files_url}{code}_{year}{month}.csv"
        response = requests.get(csv_url)
        if response.status_code == 200:
            print(f"Fetching data for {city} for {month}/{year}")
            print(f"Found data. Fetching {csv_url}")
            with open(f"{city.replace(' ', '_')}_{csv_url.split('/')[-1]}", "wb") as f:
                f.write(response.content)
        else:
            print(f"No data available for {city} for {month}/{year}...")
            continue

You can play around with this and just get the files for those cities you want, or all of them, but that might take a while.你可以玩这个,只获取你想要的那些城市或所有城市的文件,但这可能需要一段时间。

City, Month and Year are not drop downs.城市、月份和年份不是下拉菜单。 These are buttons, so can be handled using simple click operation.这些是按钮,因此可以通过简单的点击操作来处理。

Please try the below code to select city and use the same approach for Month and Year as well.请尝试使用下面的代码来选择城市,并对月份和年份使用相同的方法。

city_button=driver.find_element_by_id('cityname')  #Locate City

city_button.click()                                #Click on City List

Bukit_Timah=driver.find_element_by_xpath("//a[text()='Bukit Timah']") #Locate 'Bukit Timah' city

Bukit_Timah.click()  #Click on 'Bukit Timah' city from the list

Please refer the screenshot to understand the dom请参考截图以了解dom

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM