簡體   English   中英

確實 Web 抓取 - Python, Selenium, BeautifulSoup

[英]Indeed Web Scraping - Python, Selenium, BeautifulSoup

我正在嘗試抓取我所在領域的帖子以進行分析以確定哪些技能需求量很大。 到目前為止一切正常,除了職位描述。

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup

driver = webdriver.Chrome("./chromedriver")
driver.maximize_window()
dataframe = pd.DataFrame(columns=["Title", "Location", "Company", "Salary", "Description"])

for i in range(0, 10, 10):

    driver.get("https://www.indeed.com/jobs?q=Senior+Software+Engineer&l=Philadelphia%2C+PA&start=" + str(i))
    driver.implicitly_wait(5)

    all_jobs = driver.find_elements_by_class_name('result')

    for job in all_jobs:

        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        try:
            title = soup.find("a", class_="jobtitle").text.replace('\n', '')
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n", "").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n", "").strip()
        except:
            salary = 'None'

        sum_div = job.find_elements_by_class_name("summary")[0]
        try:
            sum_div.click()
        except:
            close_button = driver.find_elements_by_class_name("popover-x-button-close")[0]
            close_button.click()
            sum_div.click()

        try:
            jd = driver.find_element_by_id('vjs-desc').text
        except:
            jd = 'None'

        dataframe = dataframe.append({'Title': title,
                                      'Location': location,
                                      "Company": company,
                                      "Salary": salary,
                                      "Description": jd},
                                     ignore_index=True)

dataframe.to_csv("c.csv", index=False)

我嘗試過使用不同的選擇器,例如 jobDescriptionText 和 jobsearch-jobDescriptionText。 我還嘗試通過 xpath 查找元素。如果我刪除“jd”變量周圍的 try / except,每次嘗試都會收到“NoSuchElementException”錯誤。 任何和所有的幫助表示贊賞。

嘗試:

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time

driver = webdriver.Firefox(executable_path="c:/program/geckodriver.exe")
driver.maximize_window()
dataframe = pd.DataFrame(columns=["Title", "Location", "Company", "Salary", "Description"])

for i in range(0, 10, 10):

    driver.get("https://www.indeed.com/jobs?q=Senior+Software+Engineer&l=Philadelphia%2C+PA&start=" + str(i))
    driver.implicitly_wait(5)

    all_jobs = driver.find_elements_by_class_name('result')

    for job in all_jobs:

        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        try:
            title = soup.find("a", class_="jobtitle").text.replace('\n', '')
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n", "").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n", "").strip()
        except:
            salary = 'None'

        sum_div = job.find_elements_by_class_name("summary")[0]
        try:
            sum_div.click()
        except:
            close_button = driver.find_elements_by_class_name("popover-x-button-close")[0]
            close_button.click()
            sum_div.click()
        try:
            jd = driver.find_element_by_css_selector('div#vjs-desc').text
            #print(jd)
        except:
            jd = 'None'

        dataframe = dataframe.append({'Title': title,
                                      'Location': location,
                                      "Company": company,
                                      "Salary": salary,
                                      "Description": jd},
                                     ignore_index=True)

dataframe.to_csv("c.csv", index=False)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM