簡體   English   中英

如何使用 Python 從這個 javascript 頁面中抓取職業道路職位

[英]How can I scrape career path job titles from this javascript page using Python

如何使用 Python 從這個 javascript 頁面中抓取職業道路職位?

' https://www.dice.com/career-paths?title=PHP%2BDeveloper&location=San%2BDiego,%2BCalifornia,%2BUs,%2BCA&experience=0&sortBy=mostProbableTransition '

這是我的代碼片段,返回的湯沒有我需要的任何文本數據!

import requests
from bs4 import BeautifulSoup
import json
import re
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# get BeautifulSoup object
def get_soup(url):
    """
    This function returns the BeautifulSoup object.

    Parameters:
        url: the link to get soup object for

    Returns:
        soup: BeautifulSoup object
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    return soup

# get selenium driver object
def get_selenium_driver():
    """
    This function returns the selenium driver object.

    Parameters:
        None

    Returns:
        driver: selenium driver object
    """
    options = webdriver.FirefoxOptions()
    options.add_argument('-headless')

    driver = webdriver.Firefox(executable_path=r"geckodriver", firefox_options = options)

    return driver

# get soup obj using selenium
def get_soup_using_selenium(url):
    """
    Given the url of a page, this function returns the soup object.

    Parameters:
        url: the link to get soup object for

    Returns:
        soup: soup object
    """
    options = webdriver.FirefoxOptions()
    options.add_argument('-headless')

    driver = webdriver.Firefox(executable_path=r"geckodriver", firefox_options = options)
    driver.get(url)
    driver.implicitly_wait(3)

    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

    driver.close()

    return soup




title = "PHP%2BDeveloper"
location = "San%2BDiego,%2BCalifornia,%2BUs,%2BCA"
years_of_experirence = "0"
sort_by_filter = "mostProbableTransition"

url = "https://www.dice.com/career-paths?title={}&location={}&experience={}&sortBy={}".format(title, location, years_of_experirence , sort_by_filter)
career_paths_page_soup = get_soup(url)

就像另一位用戶在評論中提到的那樣,這里的requests對您不起作用。 但是,使用 Selenium,您可以使用WebDriverWait抓取頁面內容以確保已加載所有頁面內容,並使用element.text獲取網頁內容。

以下代碼片段將在頁面左側打印職業道路字符串:

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# navigate to the page
driver = get_selenium_driver()
driver.get(url)

# wait for loading indicator to be hidden
WebDriverWait(driver, 10).until(EC.invisibility_of_element((By.XPATH, "//*[contains(text(), 'Loading data')]")))

# wait for content to load
career_path_elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='abcd']/ul/li")))

# print out career paths
for element in career_path_elements:

    # get title attribute that usually contains career path text
    title = element.get_attribute("title")

    # sometimes career path is in span below this element
    if not title:

        # find the element and print its text
        span_element = element.find_element_by_xpath("span[not(contains(@class, 'currentJobHead'))]")
        print(span_element.text)

   # print title in other cases
    else:
        print(title)

這將打印以下內容:

PHP Developer
Drupal Developer
Web Developer
Full Stack Developer
Back-End Developer
Full Stack PHP Developer
IT Director
Software Development Manager

這里有一些有趣的項目。 主要是這個頁面上的 Javascript 加載——在第一次打開頁面時,會出現一個“正在加載數據...”指示符。 在我們嘗試定位任何頁面內容之前,我們必須等待EC.invisibility_of_element這個項目以確保它已經消失。

之后,我們再次調用WebDriverWait ,但這次是在頁面右側的“Career path”元素上。 這個WebDriverWait調用返回一個元素列表,存儲在career_path_elements 我們可以遍歷這個元素列表來打印每個項目的職業道路。

每個職業路徑元素都在title屬性中包含職業路徑文本,因此我們調用element.get_attribute("title")來獲取該文本。 但是,“當前職位”項目有一個特殊情況,其中職業道路文本包含在低一級的span 我們通過調用element.find_element_by_xpath()來定位span來處理title為空的情況。 這確保我們可以打印頁面上的每個職業道路項目。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM