[英]How can I scrape career path job titles from this javascript page using Python
如何使用 Python 從這個 javascript 頁面中抓取職業道路職位?
這是我的代碼片段,返回的湯沒有我需要的任何文本數據!
import requests
from bs4 import BeautifulSoup
import json
import re
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# get BeautifulSoup object
def get_soup(url):
"""
This function returns the BeautifulSoup object.
Parameters:
url: the link to get soup object for
Returns:
soup: BeautifulSoup object
"""
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
# get selenium driver object
def get_selenium_driver():
"""
This function returns the selenium driver object.
Parameters:
None
Returns:
driver: selenium driver object
"""
options = webdriver.FirefoxOptions()
options.add_argument('-headless')
driver = webdriver.Firefox(executable_path=r"geckodriver", firefox_options = options)
return driver
# get soup obj using selenium
def get_soup_using_selenium(url):
"""
Given the url of a page, this function returns the soup object.
Parameters:
url: the link to get soup object for
Returns:
soup: soup object
"""
options = webdriver.FirefoxOptions()
options.add_argument('-headless')
driver = webdriver.Firefox(executable_path=r"geckodriver", firefox_options = options)
driver.get(url)
driver.implicitly_wait(3)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
driver.close()
return soup
title = "PHP%2BDeveloper"
location = "San%2BDiego,%2BCalifornia,%2BUs,%2BCA"
years_of_experirence = "0"
sort_by_filter = "mostProbableTransition"
url = "https://www.dice.com/career-paths?title={}&location={}&experience={}&sortBy={}".format(title, location, years_of_experirence , sort_by_filter)
career_paths_page_soup = get_soup(url)
就像另一位用戶在評論中提到的那樣,這里的requests
對您不起作用。 但是,使用 Selenium,您可以使用WebDriverWait
抓取頁面內容以確保已加載所有頁面內容,並使用element.text
獲取網頁內容。
以下代碼片段將在頁面左側打印職業道路字符串:
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# navigate to the page
driver = get_selenium_driver()
driver.get(url)
# wait for loading indicator to be hidden
WebDriverWait(driver, 10).until(EC.invisibility_of_element((By.XPATH, "//*[contains(text(), 'Loading data')]")))
# wait for content to load
career_path_elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='abcd']/ul/li")))
# print out career paths
for element in career_path_elements:
# get title attribute that usually contains career path text
title = element.get_attribute("title")
# sometimes career path is in span below this element
if not title:
# find the element and print its text
span_element = element.find_element_by_xpath("span[not(contains(@class, 'currentJobHead'))]")
print(span_element.text)
# print title in other cases
else:
print(title)
這將打印以下內容:
PHP Developer
Drupal Developer
Web Developer
Full Stack Developer
Back-End Developer
Full Stack PHP Developer
IT Director
Software Development Manager
這里有一些有趣的項目。 主要是這個頁面上的 Javascript 加載——在第一次打開頁面時,會出現一個“正在加載數據...”指示符。 在我們嘗試定位任何頁面內容之前,我們必須等待EC.invisibility_of_element
這個項目以確保它已經消失。
之后,我們再次調用WebDriverWait
,但這次是在頁面右側的“Career path”元素上。 這個WebDriverWait
調用返回一個元素列表,存儲在career_path_elements
。 我們可以遍歷這個元素列表來打印每個項目的職業道路。
每個職業路徑元素都在title
屬性中包含職業路徑文本,因此我們調用element.get_attribute("title")
來獲取該文本。 但是,“當前職位”項目有一個特殊情況,其中職業道路文本包含在低一級的span
。 我們通過調用element.find_element_by_xpath()
來定位span
來處理title
為空的情況。 這確保我們可以打印頁面上的每個職業道路項目。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.