簡體   English   中英

如何使用 beatifulsoup 從網站下載所有行數據

[英]How to download all rows data from a website using beatifulsoup

我想從天氣方面得到一些信息。 https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295

分別小時和分鍾:

<div class="entry-hour">
        <span><span class="hour">0</span><span class="minutes">00</span></span>
    </div>

預測溫度:

<span class="forecast-temp">9°C</span>

和感覺溫度:

<span class="forecast-feeltemp">Odczuwalna 4°C </span>

我站着不動,因為我不知道如何獲取所有的行和數據的rest; ( 預先感謝您的幫助...

下面是我的偽代碼;)

#!/usr/bin/python3
import pymysql.cursors
from time import sleep, gmtime, strftime
import datetime
import pytz
from selenium import webdriver
from bs4 import BeautifulSoup


options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')



browser = webdriver.Chrome(
        ("/usr/bin/chromedriver"),
        chrome_options=options)

browser.get("https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295")
sleep(3)
source = browser.page_source # Get the entire page source from the browser
if browser is not None :browser.close() # No need for the browser so close it 
soup = BeautifulSoup(source,'html.parser')
try:
    Tags = soup.select('.weather-forecast-hbh-list') # get the elements using css selectors    
    for tag in Tags: # loop through them 
        hour      = tag.find('div').find('span').text
        #minutes = ?
        #temp =?
        #feel_temp = ?
        print (hour + "\n")

except Exception as e:
    print(e)

一種方法是使用 class weather-entry遍歷所有 div,然后從每個 div 中提取文本,沿途構建一個表格結構。

例如:

import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

page = requests.get('https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295').content
weather_entries = BeautifulSoup(page, "html.parser").find_all("div", {"class": "weather-entry"})


def extract_text(element, class_name):
    return element.find("div", class_=class_name).getText(strip=True)


div_classes = [
    "entry-hour",
    "entry-forecast",
    "entry-wind",
    "entry-precipitation",
    "entry-humidity",
]

table = [[extract_text(e, c) for c in div_classes] for e in weather_entries]
columns = ["Time:", "Forecast", "Wind", "Precipitation", "Humidity"]
print(tabulate(table, headers=columns, tablefmt="pretty"))

這輸出:

+-------+---------------------------------------+----------------------+---------------+----------+
| Time: |               Forecast                |         Wind         | Precipitation | Humidity |
+-------+---------------------------------------+----------------------+---------------+----------+
|  000  |     -2°COdczuwalna 0°CBezchmurnie     |   S4km/hMax 4 km/h   |               |   97%    |
|  100  |    -2°COdczuwalna -1°CBezchmurnie     |   S4km/hMax 7 km/h   |   Zachm:10%   |   98%    |
|  200  |    -2°COdczuwalna -1°CBezchmurnie     |  SSW4km/hMax 8 km/h  |               |   98%    |
|  300  |    -2°COdczuwalna -1°CBezchmurnie     |   S4km/hMax 7 km/h   |               |   98%    |
|  400  |     -2°COdczuwalna 1°CBezchmurnie     |   N0km/hMax 7 km/h   |               |   93%    |
|  500  |     -2°COdczuwalna 1°CBezchmurnie     |   N0km/hMax 6 km/h   |               |   99%    |
|  600  | -2°COdczuwalna -1°CZachmurzenie duże  |  SSW4km/hMax 6 km/h  |   Zachm:76%   |   92%    |
|  700  |  -1°COdczuwalna 3°CZachmurzenie duże  |   N0km/hMax 7 km/h   |   Zachm:76%   |   84%    |
|  800  |     -3°COdczuwalna -1°CPochmurno      |  SSW4km/hMax 8 km/h  |   Zachm:91%   |   99%    |
|  900  |      3°COdczuwalna 5°CPochmurno       |  SSW4km/hMax 8 km/h  |   Zachm:91%   |   79%    |
| 1000  |      5°COdczuwalna 4°CPochmurno       |  S11km/hMax 11 km/h  |   Zachm:91%   |   71%    |
| 1100  |      6°COdczuwalna 5°CPochmurno       | SSW11km/hMax 20 km/h |  Zachm:100%   |   65%    |
| 1200  |      9°COdczuwalna 7°CPochmurno       |  S15km/hMax 25 km/h  |  Zachm:100%   |   66%    |
| 1300  |   10°COdczuwalna 8°CPrzelotne opady   |  S15km/hMax 25 km/h  |  Zachm:100%   |   60%    |
| 1400  |      11°COdczuwalna 8°CPochmurno      |  S18km/hMax 24 km/h  |  Zachm:100%   |   55%    |
| 1500  |      10°COdczuwalna 6°CPochmurno      |  S22km/hMax 27 km/h  |   Zachm:91%   |   57%    |
| 1600  |      10°COdczuwalna 6°CPochmurno      |  S22km/hMax 31 km/h  |   Zachm:91%   |   60%    |
| 1700  |   12°COdczuwalna 8°CPrzelotne opady   |  S18km/hMax 32 km/h  |  Zachm:100%   |   53%    |
| 1800  | 9°COdczuwalna 4°CCzęściowo słonecznie |  S18km/hMax 33 km/h  |   Zachm:50%   |   66%    |
| 1900  |      8°COdczuwalna 4°CPochmurno       |  S15km/hMax 31 km/h  |  Zachm:100%   |   82%    |
| 2000  |      8°COdczuwalna 4°CPochmurno       |  S18km/hMax 22 km/h  |   Zachm:91%   |   82%    |
| 2100  |   9°COdczuwalna 5°CPrzelotne opady    | SSW18km/hMax 22 km/h |  Zachm:100%   |   78%    |
| 2200  |      8°COdczuwalna 4°CPochmurno       | SSW15km/hMax 28 km/h |  Zachm:100%   |   80%    |
| 2300  |   8°COdczuwalna 5°CPrzelotne opady    | SSW11km/hMax 25 km/h |   Zachm:91%   |   81%    |
+-------+---------------------------------------+----------------------+---------------+----------+

顯然,您需要對文本值進行一些解析,但這應該可以幫助您入門。

謝謝我的朋友,我已經明白了;)我必須先獲取所有項目並循環返回它們;)

#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup

page = requests.get('https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295').content
weather_entries = BeautifulSoup(page, "html.parser").find_all("div", {"class": "weather-entry"})
for weather_entrie in weather_entries:
    hour = weather_entrie.find('span', {'class' : 'hour'}).text
    minutes = weather_entrie.find('span', {'class' : 'minutes'}).text
    temp = weather_entrie.find('span', {'class' : 'forecast-temp'}).text
    tempFeel = weather_entrie.find('span', {'class' : 'forecast-feeltemp'}).text
    print(hour + ":" + minutes + " \t " + temp + " \t " + tempFeel)
    

我對BeatifulSoup沒有太多經驗,但同樣可以通過 selenium web 使用 xpath 刮擦自身來實現。 下面的代碼可用於提取所需的詳細信息。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

browser = webdriver.Chrome(
        ("/usr/bin/chromedriver"),
        chrome_options=options)

browser.get("https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295")
WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.XPATH, "//div[@class='entry-hour']")))
weather_entry = browser.find_elements_by_xpath("//div[@class='weather-entry']")
for w in weather_entry:
    hour = w.find_element_by_xpath(".//div[@class='entry-hour']/span/span[@class='hour']").text
    temp = w.find_element_by_xpath(".//div[@class='entry-forecast']/div//span[@class='temp-info']/span[@class='forecast-temp']").text
    feeltemp = w.find_element_by_xpath(".//div[@class='entry-forecast']/div//span[@class='temp-info']/span[@class='forecast-feeltemp']").text
    print('hour '+ hour + ' temp ' + temp + ' feeltemp ' + feeltemp)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM