简体   繁体   中英

Can't find text in li under div using BeautifulSoup

I am trying to use BeautifulSoup to get the text in ul under a div in this website: https://www.nccn.org/professionals/physician_gls/recently_updated.aspx

But I only get an empty div. My code was:

page = requests.get("https://www.nccn.org/professionals/physician_gls/recently_updated.aspx")

soup=BeautifulSoup(page.content,"html.parser")

_div=soup.find("div",{"id":"divRecentlyUpdatedList"})

element = [i.text for i in b.find("a") for b in _div.find("ul")]

The results were:

The HTML file screenshot is as follows: div and ul

Also, there is javascript coming right after the div I am trying to get the content from:

div and javascript

I also tried get all li like this:

l = []
for tag in soup.ul.find_all("a", recursive=True): 
    l.append(tag.text)

But the text I got was not what I want. Is the text under that div hidden by the javascript?

Any help is welcome. Thank you very much in advance.

The content is populated into the HTML asynchronously from the endpoint https://www.nccn.org/professionals/physician_gls/GetRecentlyUpdated.ashx , which returns JSON. Since it's populated asynchronously and via JS, requests doesn't see its results.

You can request that endpoint directly and parse the JSON instead, eg:

page = requests.get("https://www.nccn.org/professionals/physician_gls/GetRecentlyUpdated.ashx")
list = json.loads(page.content)
for item in list['recent_guidelines']:
    print(item['Name'], item['VersionNumber'], item['PublishedDate'])

The problem is actually the opposite of what you guessed: it's that the content inside <div id="divRecentlyUpdatedList"> is being filled with Javascript after an API call.

When using requests.get , any Javascript is not being executed on the website and thus we end up with an empty div. For this, you need to use a library that uses a headless browser so that the Javascript can be executed - for example requests-html :

from requests_html import HTMLSession
from bs4 import BeautifulSoup

URL = "https://www.nccn.org/professionals/physician_gls/recently_updated.aspx"

session = HTMLSession()
site = session.get(URL)
site.html.render()

html = site.html.html

soup = BeautifulSoup(html, 'html.parser')


_div=soup.find("div",{"id":"divRecentlyUpdatedList"})

Now in _div , you will have the rendered content from the API and you can continue finding the content you wish.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM