如何使用 Python 讀取整個網頁 - bs4 /response 只讀頁面的第一部分

Question

我正在嘗試抓取頁面以從頁面https://dictionary.apa.org/caffeine-intoxication （或 dictionary.apa.org 上的類似頁面）獲取術語及其定義

以下代碼（以及 bs4 也是）僅獲取頁面的<head>部分（僅 html 部分？，瀏覽器中的“另存為 html”給出相同的結果）：

import requests
url = 'https://dictionary.apa.org/a-posteriori'
response = requests.get(url, allow_redirects=True)
print(response.text)

但我真的需要獲取頁面的其他元素（ <body> ）：

<terms><term><dt><a href="/caffeine-intoxication"><h4>
<hw>caffeine intoxication</hw>
</h4></a></dt><dd>in <em>DSM–IV–TR</em> and <em>DSM–5</em>, intoxication due to recent consumption of large amounts of caffeine (typically over 250 mg), in the form of coffee, tea, cola, or medications, and involving at least five of the following symptoms: restlessness, nervousness, excitement, insomnia, flushed face, diuresis (increased urination), gastrointestinal complaints, muscle twitching, rambling thought and speech, rapid or irregular heart rhythm, periods of inexhaustibility, or psychomotor agitation. Also called <span style="font-family:arial;font-weight:bold">caffeinism</span>.</dd></term></terms>

我怎樣才能得到頁面的其余部分？

Answer 1

此數據不是來自主機域...試試這個：

res = requests.get('https://4hb3d9itb2.execute-api.us-east-1.amazonaws.com/prod/getdopdefinitions?q=a-posteriori')
sp = BeautifulSoup(res.text,'lxml')
sp.find_all('terms')

輸出：

[<terms><term><dt><a href="/a-posteriori"><h4>
 <hw>a posteriori</hw>
 </h4></a></dt><dd>denoting conclusions derived from observations or other manifest occurrences: reasoning from facts. Compare <a href="/a-priori">a priori</a>. [Latin, âfrom the latterâ]</dd></term></terms>]

請注意，“lxml”是我首選的解析器。

如何使用 Python 讀取整個網頁 - bs4 /response 只讀頁面的第一部分

問題描述

1 個解決方案

解決方案1
1 2022-07-07 16:20:41

如何使用 Python 讀取整個網頁 - bs4 /response 只讀頁面的第一部分

問題描述

1 個解決方案

解決方案1 1 2022-07-07 16:20:41

解決方案1
1 2022-07-07 16:20:41