簡體   English   中英

如何使用 Python 讀取整個網頁 - bs4 /response 只讀頁面的第一部分

[英]how to read the whole web page with Python - bs4 /response read only first part of the page

我正在嘗試抓取頁面以從頁面https://dictionary.apa.org/caffeine-intoxication (或 dictionary.apa.org 上的類似頁面)獲取術語及其定義

以下代碼(以及 bs4 也是)僅獲取頁面的<head>部分(僅 html 部分?,瀏覽器中的“另存為 html”給出相同的結果):

import requests
url = 'https://dictionary.apa.org/a-posteriori'
response = requests.get(url, allow_redirects=True)
print(response.text)

但我真的需要獲取頁面的其他元素( <body> ):

<terms><term><dt><a href="/caffeine-intoxication"><h4>
<hw>caffeine intoxication</hw>
</h4></a></dt><dd>in <em>DSM–IV–TR</em> and <em>DSM–5</em>, intoxication due to recent consumption of large amounts of caffeine (typically over 250 mg), in the form of coffee, tea, cola, or medications, and involving at least five of the following symptoms: restlessness, nervousness, excitement, insomnia, flushed face, diuresis (increased urination), gastrointestinal complaints, muscle twitching, rambling thought and speech, rapid or irregular heart rhythm, periods of inexhaustibility, or psychomotor agitation. Also called <span style="font-family:arial;font-weight:bold">caffeinism</span>.</dd></term></terms>

我怎樣才能得到頁面的其余部分?

此數據不是來自主機域...試試這個:

res = requests.get('https://4hb3d9itb2.execute-api.us-east-1.amazonaws.com/prod/getdopdefinitions?q=a-posteriori')
sp = BeautifulSoup(res.text,'lxml')
sp.find_all('terms')

輸出:

[<terms><term><dt><a href="/a-posteriori"><h4>
 <hw>a posteriori</hw>
 </h4></a></dt><dd>denoting conclusions derived from observations or other manifest occurrences: reasoning from facts. Compare <a href="/a-priori">a priori</a>. [Latin, âfrom the latterâ]</dd></term></terms>]

請注意,“lxml”是我首選的解析器。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM