简体   繁体   中英

Problem with data extraction from Indeed by BeautifulSoup

I'm trying to extract job descriptions for each post from Indeed website but, the result is not what I expected!

I've written a code to get job descriptions. I'm working with python 2.7 and the latest beautifulsoup. When you open the page and click on each job title, you will see the related information on the right side of the screen. I need to extract those job descriptions for each job on this page. My Code:

import sys

import urllib2 

from BeautifulSoup import BeautifulSoup

url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"

html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

N = soup.findAll("div", {"id" : "vjs-desc"})

print N

I expected to see the results but instead, I got [] as the result. Is it because the Id is non-unique. If so, how should I edit the code?

the #vjs-desc element is generated by javascript and the content are from Ajax request. To get the description you need to do that request.

# -*- coding: utf-8 -*-

# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup

url = "https://www......"

# create session
s = requests.session()
html = s.get(url).text

# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json 
ajax_content = s.get(ajax_url).json()
print(ajax_content)

for id, desc in ajax_content.items():
    print id
    soup = BeautifulSoup(desc, 'html.parser')
    # or try this
    # soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
    print soup.text.encode('utf-8')
    print('==============================')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM