![](/img/trans.png)
[英]web scraping using nested for loops, BeautifulSoup in python3
[英]Scraping web data with select/option using request_html and BeautifulSoup in Python3
我是數據抓取的新手,但我不會在沒有四處尋找合適答案的情況下粗心地問這個問題。
我想從這個頁面下載表格: https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje 。
正如您從以下屏幕截圖中看到的那樣,表格頂部有幾個選擇/選項。 對應的 html 代碼(右側)顯示選擇了下半年(2)和 2021 年。 通過重新選擇並重新提交表格,表格的內容發生了變化,但 url 保持不變。 但是,這些更改反映在 html 代碼中。 請參見下面的第二個屏幕截圖,其中選項被修改為 1 和 2018。
基於這些檢查,我整理了一個 python 腳本(使用 bs4 和 requests_html)來獲取初始頁面,修改選擇/選項,然后將它們發布回 url。 請參閱下面的代碼。 但是,它的任務失敗了。 網頁不響應修改。 任何人都可以解釋一下嗎?
提前致謝,
梁
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin
url = "https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje#"
# initialize an HTTP session
session = HTMLSession()
# Get request
res = session.get(url)
# for javascript driven website
# res.html.render()
soup = BeautifulSoup(res.html.html, "html.parser")
# Get all select tags
selects = soup.find_all("select")
# Modify select tags
# Select the first half of a year
selects[0].contents[1].attrs['selected']=''
del selects[0].contents[3].attrs['selected']
# Put into a dictionary
data = {}
data[selects[0]['name']] = selects[0]
data[selects[1]['name']] = selects[1]
# Post it back to the website
res = session.post(url, data=data)
# Remake the soup after the modification
soup = BeautifulSoup(res.content, "html.parser")
# the below code is only for replacing relative URLs to absolute ones
for link in soup.find_all("link"):
try:
link.attrs["href"] = urljoin(url, link.attrs["href"])
except:
pass
for script in soup.find_all("script"):
try:
script.attrs["src"] = urljoin(url, script.attrs["src"])
except:
pass
for img in soup.find_all("img"):
try:
img.attrs["src"] = urljoin(url, img.attrs["src"])
except:
pass
for a in soup.find_all("a"):
try:
a.attrs["href"] = urljoin(url, a.attrs["href"])
except:
pass
# write the page content to a file
open("page.html", "w").write(str(soup))
該選項可以通過 POST 進行,並將semestre
和ano
作為參數傳遞。 例如:
import pandas as pd
import requests
semestre = 1
ano = 2018
url = 'https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje'
payload = {
'semestre': '%s' %semestre,
'ano': '%s' %ano,
'buscar': 'Buscar'}
response = requests.post(url, params=payload)
df = pd.read_html(response.text)[7]
Output:
print(df)
0 1 ... 11 12
0 Dias Julho ... Dezembro Dezembro
1 Dias Cota (m) ... Cota (m) Encheu/ Vazou (cm)
2 1 2994 ... 000 000
3 2 2991 ... 000 000
4 3 2989 ... 000 000
5 4 2988 ... 000 000
6 5 2987 ... 000 000
7 6 2985 ... 000 000
8 7 2983 ... 000 000
9 8 2980 ... 000 000
10 9 2977 ... 000 000
11 10 2975 ... 000 000
12 11 2972 ... 000 000
13 12 2969 ... 000 000
14 13 2967 ... 000 000
15 14 2965 ... 000 000
16 15 2962 ... 000 000
17 16 2959 ... 000 000
18 17 2955 ... 000 000
19 18 2951 ... 000 000
20 19 2946 ... 000 000
21 20 2942 ... 000 000
22 21 2939 ... 000 000
23 22 2935 ... 000 000
24 23 2931 ... 000 000
25 24 2927 ... 000 000
26 25 2923 ... 000 000
27 26 2918 ... 000 000
28 27 2912 ... 000 000
29 28 2908 ... 000 000
30 29 2902 ... 000 000
31 30 2896 ... 000 000
32 31 2892 ... 000 000
33 Estatísticas Encheu ... Estável Estável
34 Estatísticas Vazou ... Estável Estável
35 Estatísticas Mínima ... Mínima 000
36 Estatísticas Média ... Média 000
37 Estatísticas Máxima ... Máxima 000
[38 rows x 13 columns]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.