简体   繁体   中英

For the BeautifulSoup specialists: How do I scrape a page with multiple panes?

Here is a link to the page that I'm trying to scrape:

https://www.simplyhired.ca/search?q=data+analyst&l=Vancouver%2C+BC&job=grivOJsfWcVasT2RpqgQ_YBEs-tw6BCz9INhDIHbT92XtKCbBcXP8g%27

More specifically, I'm trying to scrape the 'Qualifications' element on the page.

When I print the soup object, I do not see the HTML code for the right pane.

Any thoughts on how I could access these elements?

Thanks in advance!

The DOM elements of the page you're trying to scrape are populated asynchronously using JavaScript. In other words, the information you're trying to scrape is not actually baked into the HTML at the time the server serves the page document to you, so BeautifulSoup can't see it - the document you get back is just a "bare bones" template, which, normally, when viewed in a browser like it's meant to be, will be populated via JavaScript, pulling the required information from various other places. You can expect most modern, dynamic websites to be implemented in this way. BeautifulSoup will only work for pages whose content is baked into the HTML at the time it is served to you by the server. The fact that some elements of the page take some time to load when viewed in a browser is an instant give-away - any time you see that, your first thought should be "DOM is populated asynchronously using JavaScript. BeautifulSoup won't work for this". If it's a Single-Page Application, you can forget BeautifulSoup.

Upon visiting the page in my browser, I logged my network traffic and saw that it made multiple XHR (XmlHttpRequest) HTTP GET requests, one of which was to a REST API that serves JSON which contains all the job information you're looking for. All you need to do is imitate that HTTP GET request to that same API URL, with the same query-string parameters (the API doesn't seem to care about request headers, which is nice). No BeautifulSoup or Selenium required:

def main():

    import requests

    url = "https://www.simplyhired.ca/api/job"

    params = {
        "key": "grivOJsfWcVasT2RpqgQ_YBEs-tw6BCz9INhDIHbT92XtKCbBcXP8g",
        "isp": "0",
        "al": "1",
        "ia": "0",
        "tk": "1f4aknr5vs7aq800",
        "tkt": "serp",
        "from": "manual",
        "jatk": "",
        "q": "data%20analyst"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    print(response.json()["skillEntities"])
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

Output:

["Tableau", "SQL"]
>>> 

For more information about logging your network traffic, finding the API URL and exploring all the information available to you in the JSON response, Take a look at one of my other answers where I go more in depth.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM