简体   繁体   English

如何使用 Python 读取整个网页 - bs4 /response 只读页面的第一部分

[英]how to read the whole web page with Python - bs4 /response read only first part of the page

I am trying to scrape the page to get the term and its definition from the page https://dictionary.apa.org/caffeine-intoxication (or similar pages at dictionary.apa.org)我正在尝试抓取页面以从页面https://dictionary.apa.org/caffeine-intoxication (或 dictionary.apa.org 上的类似页面)获取术语及其定义

The following code (and bs4 too) gets only <head> part of the page (only html part?, 'save as html' in a browser gives the same result):以下代码(以及 bs4 也是)仅获取页面的<head>部分(仅 html 部分?,浏览器中的“另存为 html”给出相同的结果):

import requests
url = 'https://dictionary.apa.org/a-posteriori'
response = requests.get(url, allow_redirects=True)
print(response.text)

But I really need to get other elements of the page ( <body> ):但我真的需要获取页面的其他元素( <body> ):

<terms><term><dt><a href="/caffeine-intoxication"><h4>
<hw>caffeine intoxication</hw>
</h4></a></dt><dd>in <em>DSM–IV–TR</em> and <em>DSM–5</em>, intoxication due to recent consumption of large amounts of caffeine (typically over 250 mg), in the form of coffee, tea, cola, or medications, and involving at least five of the following symptoms: restlessness, nervousness, excitement, insomnia, flushed face, diuresis (increased urination), gastrointestinal complaints, muscle twitching, rambling thought and speech, rapid or irregular heart rhythm, periods of inexhaustibility, or psychomotor agitation. Also called <span style="font-family:arial;font-weight:bold">caffeinism</span>.</dd></term></terms>

How can I get the rest of the page?我怎样才能得到页面的其余部分?

This data is not coming from the host domain... try this:此数据不是来自主机域...试试这个:

res = requests.get('https://4hb3d9itb2.execute-api.us-east-1.amazonaws.com/prod/getdopdefinitions?q=a-posteriori')
sp = BeautifulSoup(res.text,'lxml')
sp.find_all('terms')

output:输出:

[<terms><term><dt><a href="/a-posteriori"><h4>
 <hw>a posteriori</hw>
 </h4></a></dt><dd>denoting conclusions derived from observations or other manifest occurrences: reasoning from facts. Compare <a href="/a-priori">a priori</a>. [Latin, âfrom the latterâ]</dd></term></terms>]

Note that 'lxml' is my preferred parser.请注意,“lxml”是我首选的解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM