在 Python3 中使用 request_html 和 BeautifulSoup 通過選擇/選項抓取 web 數據

Question

我是數據抓取的新手，但我不會在沒有四處尋找合適答案的情況下粗心地問這個問題。

我想從這個頁面下載表格： https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje 。

正如您從以下屏幕截圖中看到的那樣，表格頂部有幾個選擇/選項。 對應的 html 代碼（右側）顯示選擇了下半年（2）和 2021 年。 通過重新選擇並重新提交表格，表格的內容發生了變化，但 url 保持不變。 但是，這些更改反映在 html 代碼中。 請參見下面的第二個屏幕截圖，其中選項被修改為 1 和 2018。

基於這些檢查，我整理了一個 python 腳本（使用 bs4 和 requests_html）來獲取初始頁面，修改選擇/選項，然后將它們發布回 url。 請參閱下面的代碼。 但是，它的任務失敗了。 網頁不響應修改。 任何人都可以解釋一下嗎？

提前致謝，

梁

from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin

url = "https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje#"

# initialize an HTTP session
session = HTMLSession()

# Get request
res = session.get(url)

# for javascript driven website
# res.html.render()
soup = BeautifulSoup(res.html.html, "html.parser")

# Get all select tags
selects = soup.find_all("select")

# Modify select tags
# Select the first half of a year
selects[0].contents[1].attrs['selected']=''
del selects[0].contents[3].attrs['selected']

# Put into a dictionary
data = {}
data[selects[0]['name']] = selects[0]
data[selects[1]['name']] = selects[1]

# Post it back to the website
res = session.post(url, data=data)

# Remake the soup after the modification
soup = BeautifulSoup(res.content, "html.parser")

# the below code is only for replacing relative URLs to absolute ones
for link in soup.find_all("link"):
    try:
        link.attrs["href"] = urljoin(url, link.attrs["href"])
    except:
        pass
for script in soup.find_all("script"):
    try:
        script.attrs["src"] = urljoin(url, script.attrs["src"])
    except:
        pass
for img in soup.find_all("img"):
    try:
        img.attrs["src"] = urljoin(url, img.attrs["src"])
    except:
        pass
for a in soup.find_all("a"):
    try:
        a.attrs["href"] = urljoin(url, a.attrs["href"])
    except:
        pass

# write the page content to a file
open("page.html", "w").write(str(soup))

Answer 1

該選項可以通過 POST 進行，並將semestre和ano作為參數傳遞。 例如：

import pandas as pd
import requests

semestre = 1
ano = 2018

url = 'https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje'
payload = {
'semestre': '%s' %semestre,
'ano': '%s' %ano,
'buscar': 'Buscar'}

response = requests.post(url, params=payload)
df = pd.read_html(response.text)[7]

Output：

print(df)
              0         1   ...        11                  12
0           Dias     Julho  ...  Dezembro            Dezembro
1           Dias  Cota (m)  ...  Cota (m)  Encheu/ Vazou (cm)
2              1      2994  ...       000                 000
3              2      2991  ...       000                 000
4              3      2989  ...       000                 000
5              4      2988  ...       000                 000
6              5      2987  ...       000                 000
7              6      2985  ...       000                 000
8              7      2983  ...       000                 000
9              8      2980  ...       000                 000
10             9      2977  ...       000                 000
11            10      2975  ...       000                 000
12            11      2972  ...       000                 000
13            12      2969  ...       000                 000
14            13      2967  ...       000                 000
15            14      2965  ...       000                 000
16            15      2962  ...       000                 000
17            16      2959  ...       000                 000
18            17      2955  ...       000                 000
19            18      2951  ...       000                 000
20            19      2946  ...       000                 000
21            20      2942  ...       000                 000
22            21      2939  ...       000                 000
23            22      2935  ...       000                 000
24            23      2931  ...       000                 000
25            24      2927  ...       000                 000
26            25      2923  ...       000                 000
27            26      2918  ...       000                 000
28            27      2912  ...       000                 000
29            28      2908  ...       000                 000
30            29      2902  ...       000                 000
31            30      2896  ...       000                 000
32            31      2892  ...       000                 000
33  Estatísticas    Encheu  ...   Estável             Estável
34  Estatísticas     Vazou  ...   Estável             Estável
35  Estatísticas    Mínima  ...    Mínima                 000
36  Estatísticas     Média  ...     Média                 000
37  Estatísticas    Máxima  ...    Máxima                 000

[38 rows x 13 columns]

在 Python3 中使用 request_html 和 BeautifulSoup 通過選擇/選項抓取 web 數據

問題描述

1 個解決方案

解決方案1
2 2021-11-18 16:38:37

在 Python3 中使用 request_html 和 BeautifulSoup 通過選擇/選項抓取 web 數據

問題描述

1 個解決方案

解決方案1 2 2021-11-18 16:38:37

解決方案1
2 2021-11-18 16:38:37