繁体   English   中英

从 LinkedIn 上抓取职位描述

[英]Scraping Job Descriptions off of LinkedIn

我创建了一个 python 脚本,它使用 Selenium 的库来抓取:

  1. 职称
  2. 公司名称
  3. 工作地点
  4. 职位描述(我需要帮助,)。 关闭LinkedIn职位搜索部分。

我创建了一个 for 循环来迭代 (25) 个作业,以使用每个描述使用的相同 class 名称提取每个作业的描述。 我已经能够成功提取 (1) 个描述,但无法删除剩余 (24) 个工作的其他描述。 我假设它的循环无法解析每个部分,但是如果它能够成功提取 (1) 描述,为什么其他描述没有出现?

 import pandas as pd import re from selenium import webdriver from selenium.webdriver.common.keys import Keys # This will open a new Chrome page to test specified url on (for scraping) browser=webdriver.Chrome("My Chrome Path") browser.get("https://www.linkedin.com") # Requires user to enter username and password username=browser.find_element_by_id("session_key") username.send_keys("ENTER USERNAME") password=browser.find_element_by_id("session_password") password.send_keys("ENTER PASSWORD") # Once username and password are entered, this will automatically click the submit button to login into LinkedIn login_button=browser.find_element_by_class_name("sign-in-form__submit-button") login_button.click() # This is the URL to test the jobs I want to scrape from browser.get("https://www.linkedin.com/jobs/search/?keywords=software%20developer") # This will scrape and display (25) job titles from page (1) job_title=browser.find_elements_by_class_name("job-card-list__title") company_title=[] for i in job_title: company_title.append(i.text) print(company_title) print() print(len(company_title)) # This will scrape and display (25) company names from page (1) - correspondent to company_title above job_company=browser.find_elements_by_class_name("job-card-container__company-name") company_name=[] for i in job_company: company_name.append(i.text) print(company_name) print() print(len(company_name)) # This will scrape and display (25) location names from page (1) - correspondent to company_title and company_name above job_location=browser.find_elements_by_class_name("job-card-container__metadata-item") location_name=[] for i in job_location: location_name.append(i.text) print(location_name) print() print(len(location_name)) # At this point, I am trying to iterate over each of the (25) jobs to pull out the description. I've successfully been able to pull out (1) description, but haven't been able to pull out the other descriptions of the remaining (24) jobs. job_description=browser.find_elements_by_class_name('jobs-search__right-rail') description_name = [] for i in job_description: description_name.append(i.text) print(description_name) print() print(len(description_name))

问题与页面的加载方式有关。 每次单击新的 Job 容器时,它都会向服务器发送不同的 GET 请求。

This link, by default, has the first job selected.    
https://www.linkedin.com/jobs/search/?keywords=software%20developer

When you click another page, it changes the job id. 
Example: 
https://www.linkedin.com/jobs/search/?currentJobId=2512009247&keywords=software%20developer

因此,您可以模拟对容器的单击,也可以通过从页面中抓取 id 并使用新链接重新加载页面来更改 currentJobId。

# Example of scraping the currentJobId for each item.
job_containers = browser.find_elements_by_class_name('job-card-container relative job-card-list job-card-container--clickable job-card-list--underline-title-on-hover jobs-search-results-list__list-item--active jobs-search-two-pane__job-card-container--viewport-tracking-0')
job_ids = []
for job_container in job_containers:
    job_ids.append(job_container.get_attribute("data-job-id"))

Function 获取说明

def get_descriptions(browser, job_ids):
    job_descriptions = []
    for job_id in job_ids:
        browser.get(f'https://www.linkedin.com/jobs/search/?currentJobId={job_id}&keywords=software%20developer')
        job_description = browser.find_elements_by_class_name('jobs-search__right-rail')[0].text
        job_descriptions.append(job_description)

    return job_descriptions

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM