简体   繁体   中英

how to read the whole web page with Python - bs4 /response read only first part of the page

I am trying to scrape the page to get the term and its definition from the page https://dictionary.apa.org/caffeine-intoxication (or similar pages at dictionary.apa.org)

The following code (and bs4 too) gets only <head> part of the page (only html part?, 'save as html' in a browser gives the same result):

import requests
url = 'https://dictionary.apa.org/a-posteriori'
response = requests.get(url, allow_redirects=True)
print(response.text)

But I really need to get other elements of the page ( <body> ):

<terms><term><dt><a href="/caffeine-intoxication"><h4>
<hw>caffeine intoxication</hw>
</h4></a></dt><dd>in <em>DSM–IV–TR</em> and <em>DSM–5</em>, intoxication due to recent consumption of large amounts of caffeine (typically over 250 mg), in the form of coffee, tea, cola, or medications, and involving at least five of the following symptoms: restlessness, nervousness, excitement, insomnia, flushed face, diuresis (increased urination), gastrointestinal complaints, muscle twitching, rambling thought and speech, rapid or irregular heart rhythm, periods of inexhaustibility, or psychomotor agitation. Also called <span style="font-family:arial;font-weight:bold">caffeinism</span>.</dd></term></terms>

How can I get the rest of the page?

This data is not coming from the host domain... try this:

res = requests.get('https://4hb3d9itb2.execute-api.us-east-1.amazonaws.com/prod/getdopdefinitions?q=a-posteriori')
sp = BeautifulSoup(res.text,'lxml')
sp.find_all('terms')

output:

[<terms><term><dt><a href="/a-posteriori"><h4>
 <hw>a posteriori</hw>
 </h4></a></dt><dd>denoting conclusions derived from observations or other manifest occurrences: reasoning from facts. Compare <a href="/a-priori">a priori</a>. [Latin, âfrom the latterâ]</dd></term></terms>]

Note that 'lxml' is my preferred parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM