简体   繁体   中英

Why is BeautifulSoup's findAll returning an empty list when I search by class?

I am trying to web-scrape using an h2 tag, but BeautifulSoup returns an empty list.

<h2 class="iCIMS_InfoMsg iCIMS_InfoField_Job">

html=urlopen("https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job")
bs0bj=BeautifulSoup(html,"lxml")
nameList=bs0bj.findAll("h2",{"class":"iCIMS_InfoMsg iCIMS_InfoField_Job"})
print(nameList)

The content is inside an iframe and updated via js (so not present in initial request). You can use the same link the page is using to obtain iframe content (the iframe src ). Then extract the string from the script tag that has the info and load with json , extract the description (which is html) and pass back to bs to then select the h2 tags. You now have the rest of the info stored in the second soup object as well if required.

import requests
from bs4 import BeautifulSoup as bs
import json

r = requests.get('https://careersus-endologix.icims.com/jobs/2034/associate-supplier-quality-engineer/job?mobile=false&width=1140&height=500&bga=true&needsRedirect=false&jan1offset=0&jun1offset=60&in_iframe=1')
soup = bs(r.content, 'lxml')
script = soup.select_one('[type="application/ld+json"]').text
data = json.loads(script)
soup = bs(data['description'], 'lxml')
headers = [item.text for item in soup.select('h2')]
print(headers)

在此处输入图片说明

The answer lays hidden in two elements:

  1. javascript rendered contents: after document.onload
  2. in particular the content managed by js comes after this comment and it's, indeed, rendered by js. The line where the block starts is: "< ! - -BEGIN ICIMS - - >" (space added to avoid it goes blank)

As you can imagine the h2 class="ICISM class here" DOESN'T exist WHEN you call the bs4 methods.

The solution? IMHO the best way to achieve what you want is to use selenium, to get a full rendered web page.

check this also Web-scraping JavaScript page with Python

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM