[英]How to download all rows data from a website using beatifulsoup
我想从天气方面得到一些信息。 https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295
分别小时和分钟:
<div class="entry-hour">
<span><span class="hour">0</span><span class="minutes">00</span></span>
</div>
预测温度:
<span class="forecast-temp">9°C</span>
和感觉温度:
<span class="forecast-feeltemp">Odczuwalna 4°C </span>
我站着不动,因为我不知道如何获取所有的行和数据的rest; ( 预先感谢您的帮助...
下面是我的伪代码;)
#!/usr/bin/python3
import pymysql.cursors
from time import sleep, gmtime, strftime
import datetime
import pytz
from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(
("/usr/bin/chromedriver"),
chrome_options=options)
browser.get("https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295")
sleep(3)
source = browser.page_source # Get the entire page source from the browser
if browser is not None :browser.close() # No need for the browser so close it
soup = BeautifulSoup(source,'html.parser')
try:
Tags = soup.select('.weather-forecast-hbh-list') # get the elements using css selectors
for tag in Tags: # loop through them
hour = tag.find('div').find('span').text
#minutes = ?
#temp =?
#feel_temp = ?
print (hour + "\n")
except Exception as e:
print(e)
一种方法是使用 class weather-entry
遍历所有 div,然后从每个 div 中提取文本,沿途构建一个表格结构。
例如:
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate
page = requests.get('https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295').content
weather_entries = BeautifulSoup(page, "html.parser").find_all("div", {"class": "weather-entry"})
def extract_text(element, class_name):
return element.find("div", class_=class_name).getText(strip=True)
div_classes = [
"entry-hour",
"entry-forecast",
"entry-wind",
"entry-precipitation",
"entry-humidity",
]
table = [[extract_text(e, c) for c in div_classes] for e in weather_entries]
columns = ["Time:", "Forecast", "Wind", "Precipitation", "Humidity"]
print(tabulate(table, headers=columns, tablefmt="pretty"))
这输出:
+-------+---------------------------------------+----------------------+---------------+----------+
| Time: | Forecast | Wind | Precipitation | Humidity |
+-------+---------------------------------------+----------------------+---------------+----------+
| 000 | -2°COdczuwalna 0°CBezchmurnie | S4km/hMax 4 km/h | | 97% |
| 100 | -2°COdczuwalna -1°CBezchmurnie | S4km/hMax 7 km/h | Zachm:10% | 98% |
| 200 | -2°COdczuwalna -1°CBezchmurnie | SSW4km/hMax 8 km/h | | 98% |
| 300 | -2°COdczuwalna -1°CBezchmurnie | S4km/hMax 7 km/h | | 98% |
| 400 | -2°COdczuwalna 1°CBezchmurnie | N0km/hMax 7 km/h | | 93% |
| 500 | -2°COdczuwalna 1°CBezchmurnie | N0km/hMax 6 km/h | | 99% |
| 600 | -2°COdczuwalna -1°CZachmurzenie duże | SSW4km/hMax 6 km/h | Zachm:76% | 92% |
| 700 | -1°COdczuwalna 3°CZachmurzenie duże | N0km/hMax 7 km/h | Zachm:76% | 84% |
| 800 | -3°COdczuwalna -1°CPochmurno | SSW4km/hMax 8 km/h | Zachm:91% | 99% |
| 900 | 3°COdczuwalna 5°CPochmurno | SSW4km/hMax 8 km/h | Zachm:91% | 79% |
| 1000 | 5°COdczuwalna 4°CPochmurno | S11km/hMax 11 km/h | Zachm:91% | 71% |
| 1100 | 6°COdczuwalna 5°CPochmurno | SSW11km/hMax 20 km/h | Zachm:100% | 65% |
| 1200 | 9°COdczuwalna 7°CPochmurno | S15km/hMax 25 km/h | Zachm:100% | 66% |
| 1300 | 10°COdczuwalna 8°CPrzelotne opady | S15km/hMax 25 km/h | Zachm:100% | 60% |
| 1400 | 11°COdczuwalna 8°CPochmurno | S18km/hMax 24 km/h | Zachm:100% | 55% |
| 1500 | 10°COdczuwalna 6°CPochmurno | S22km/hMax 27 km/h | Zachm:91% | 57% |
| 1600 | 10°COdczuwalna 6°CPochmurno | S22km/hMax 31 km/h | Zachm:91% | 60% |
| 1700 | 12°COdczuwalna 8°CPrzelotne opady | S18km/hMax 32 km/h | Zachm:100% | 53% |
| 1800 | 9°COdczuwalna 4°CCzęściowo słonecznie | S18km/hMax 33 km/h | Zachm:50% | 66% |
| 1900 | 8°COdczuwalna 4°CPochmurno | S15km/hMax 31 km/h | Zachm:100% | 82% |
| 2000 | 8°COdczuwalna 4°CPochmurno | S18km/hMax 22 km/h | Zachm:91% | 82% |
| 2100 | 9°COdczuwalna 5°CPrzelotne opady | SSW18km/hMax 22 km/h | Zachm:100% | 78% |
| 2200 | 8°COdczuwalna 4°CPochmurno | SSW15km/hMax 28 km/h | Zachm:100% | 80% |
| 2300 | 8°COdczuwalna 5°CPrzelotne opady | SSW11km/hMax 25 km/h | Zachm:91% | 81% |
+-------+---------------------------------------+----------------------+---------------+----------+
显然,您需要对文本值进行一些解析,但这应该可以帮助您入门。
谢谢我的朋友,我已经明白了;)我必须先获取所有项目并循环返回它们;)
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
page = requests.get('https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295').content
weather_entries = BeautifulSoup(page, "html.parser").find_all("div", {"class": "weather-entry"})
for weather_entrie in weather_entries:
hour = weather_entrie.find('span', {'class' : 'hour'}).text
minutes = weather_entrie.find('span', {'class' : 'minutes'}).text
temp = weather_entrie.find('span', {'class' : 'forecast-temp'}).text
tempFeel = weather_entrie.find('span', {'class' : 'forecast-feeltemp'}).text
print(hour + ":" + minutes + " \t " + temp + " \t " + tempFeel)
我对BeatifulSoup
没有太多经验,但同样可以通过 selenium web 使用 xpath 刮擦自身来实现。 下面的代码可用于提取所需的详细信息。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(
("/usr/bin/chromedriver"),
chrome_options=options)
browser.get("https://pogoda.interia.pl/archiwum-pogody-08-10-2019,cId,21295")
WebDriverWait(browser, 30).until(EC.presence_of_element_located((By.XPATH, "//div[@class='entry-hour']")))
weather_entry = browser.find_elements_by_xpath("//div[@class='weather-entry']")
for w in weather_entry:
hour = w.find_element_by_xpath(".//div[@class='entry-hour']/span/span[@class='hour']").text
temp = w.find_element_by_xpath(".//div[@class='entry-forecast']/div//span[@class='temp-info']/span[@class='forecast-temp']").text
feeltemp = w.find_element_by_xpath(".//div[@class='entry-forecast']/div//span[@class='temp-info']/span[@class='forecast-feeltemp']").text
print('hour '+ hour + ' temp ' + temp + ' feeltemp ' + feeltemp)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.