確實 Web 抓取 - Python, Selenium, BeautifulSoup

Question

我正在嘗試抓取我所在領域的帖子以進行分析以確定哪些技能需求量很大。 到目前為止一切正常，除了職位描述。

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup

driver = webdriver.Chrome("./chromedriver")
driver.maximize_window()
dataframe = pd.DataFrame(columns=["Title", "Location", "Company", "Salary", "Description"])

for i in range(0, 10, 10):

    driver.get("https://www.indeed.com/jobs?q=Senior+Software+Engineer&l=Philadelphia%2C+PA&start=" + str(i))
    driver.implicitly_wait(5)

    all_jobs = driver.find_elements_by_class_name('result')

    for job in all_jobs:

        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        try:
            title = soup.find("a", class_="jobtitle").text.replace('\n', '')
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n", "").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n", "").strip()
        except:
            salary = 'None'

        sum_div = job.find_elements_by_class_name("summary")[0]
        try:
            sum_div.click()
        except:
            close_button = driver.find_elements_by_class_name("popover-x-button-close")[0]
            close_button.click()
            sum_div.click()

        try:
            jd = driver.find_element_by_id('vjs-desc').text
        except:
            jd = 'None'

        dataframe = dataframe.append({'Title': title,
                                      'Location': location,
                                      "Company": company,
                                      "Salary": salary,
                                      "Description": jd},
                                     ignore_index=True)

dataframe.to_csv("c.csv", index=False)

我嘗試過使用不同的選擇器，例如 jobDescriptionText 和 jobsearch-jobDescriptionText。 我還嘗試通過 xpath 查找元素。如果我刪除“jd”變量周圍的 try / except，每次嘗試都會收到“NoSuchElementException”錯誤。 任何和所有的幫助表示贊賞。

Answer 1

嘗試：

from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import time

driver = webdriver.Firefox(executable_path="c:/program/geckodriver.exe")
driver.maximize_window()
dataframe = pd.DataFrame(columns=["Title", "Location", "Company", "Salary", "Description"])

for i in range(0, 10, 10):

    driver.get("https://www.indeed.com/jobs?q=Senior+Software+Engineer&l=Philadelphia%2C+PA&start=" + str(i))
    driver.implicitly_wait(5)

    all_jobs = driver.find_elements_by_class_name('result')

    for job in all_jobs:

        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        try:
            title = soup.find("a", class_="jobtitle").text.replace('\n', '')
        except:
            title = 'None'

        try:
            location = soup.find(class_="location").text
        except:
            location = 'None'

        try:
            company = soup.find(class_="company").text.replace("\n", "").strip()
        except:
            company = 'None'

        try:
            salary = soup.find(class_="salary").text.replace("\n", "").strip()
        except:
            salary = 'None'

        sum_div = job.find_elements_by_class_name("summary")[0]
        try:
            sum_div.click()
        except:
            close_button = driver.find_elements_by_class_name("popover-x-button-close")[0]
            close_button.click()
            sum_div.click()
        try:
            jd = driver.find_element_by_css_selector('div#vjs-desc').text
            #print(jd)
        except:
            jd = 'None'

        dataframe = dataframe.append({'Title': title,
                                      'Location': location,
                                      "Company": company,
                                      "Salary": salary,
                                      "Description": jd},
                                     ignore_index=True)

dataframe.to_csv("c.csv", index=False)

確實 Web 抓取 - Python, Selenium, BeautifulSoup

問題描述

1 個解決方案

解決方案1
0 已采納 2020-08-03 15:36:36

確實 Web 抓取 - Python, Selenium, BeautifulSoup

問題描述

1 個解決方案

解決方案1 0 已采納 2020-08-03 15:36:36

解決方案1
0 已采納 2020-08-03 15:36:36