简体   繁体   中英

Web scraping LinkedIn job posts using Selenium gives repeated or empty results

I am trying to get the job post data from LinkedIn using Selenium for a practice project.

I am getting the list of job card elements and the job IDs and clicking on each of them to load the job post, and then obtaining the job details.

import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException


login_page_link = 'https://www.linkedin.com/login'
search_page_link = 'https://www.linkedin.com/jobs/search/?geoId=101452733&keywords=data%20analyst&location=Australia&refresh=true'
job_list_item_class = 'jobs-search-results__list-item'
job_title_class = 'jobs-unified-top-card__job-title'
company_name_class = 'jobs-unified-top-card__company-name'


def get_browser_driver():
    browser = webdriver.Chrome()
    # maximise browser window
    browser.maximize_window()
    return browser

def login_to_linkedin(browser):
    browser.get(login_page_link)
    # enter login credentials
    browser.find_element(by=By.ID, value=username_id).send_keys("username@mail.com")
    browser.find_element(by=By.ID, value=password_id).send_keys("pwd")
    login_btn = browser.find_element(by=By.XPATH, value=login_btn_xpath)
    # attempt login
    login_btn.click()
    # wait till new page is loaded
    time.sleep(2)

def get_job_post_data(browser):
    # list to store job posts
    job_posts = []
    
    # get the search results list
    job_cards = browser.find_elements(by=By.CLASS_NAME, value=job_list_item_class)

    for job_card in job_cards:
        job_id = job_card.get_attribute('data-occludable-job-id')

        # dict to store each job post
        job_dict = {}
        # scroll job post into view
        browser.execute_script("arguments[0].scrollIntoView();", job_card)
        # click to load each job post
        job_card.click()
        time.sleep(5)

        # get elements from job post by css selector
        job_dict['Job ID'] = job_id
        job_dict['Job title'] = get_element_text_by_classname(browser, job_title_class)
        job_dict['Company name'] = get_element_text_by_classname(browser, company_name_class)
    
        job_posts.append(job_dict)
    return job_posts

def get_element_text_by_classname(browser, class_name):
    return browser.find_element(by=By.CLASS_NAME, value=class_name).text


browser = get_browser_driver()
login_to_linkedin(browser)
load_search_results_page(browser)
jobs_list = get_job_post_data(browser)
jobs_df = pd.DataFrame(all_jobs)

When I try to scrape all job posts in the page, it gives me repeated (duplicated) and empty results as shown in the images below. The job ID, keeps updating and changing but the job details get randomly duplicated.

结果-1 结果-2 结果-3

I would be much thankful if you could suggest any ideas as to why this is happening and how to fix this error.

It should be worth noting that scraping LinkedIn Data is against the company's terms of service, which may result in your account being blocked.

That being said, check these common things:

  1. The for loop in the get_job_post_data function is iterating through the same set of job cards multiple times, resulting in duplicate results.

  2. The job_cards variable is not being initialized correctly, resulting in an empty list.

  3. The load_search_results_page function is not defined in the code.

  4. The get_element_text_by_classname function may not be able to find the element by class name, leading to empty results.

And Finally: Due to LinkedIn's dynamic website and anti-scraping policy, the elements that you are trying to scrape may not be loaded properly.(as stated initially)

When you click on a card on the left side, info on the right side take some time to load, so to be sure to get them before going to the next card you have to wait for them to be visible. We can do this by using WebDriverWait together with expected_conditions.

Notice that ids, titles and company names are already visible in the list on left side so we can get them using list comprehension.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# load all cards
cards = []
while len(cards) < 25:
    cards = driver.find_elements(By.CSS_SELECTOR, '.job-card-container')
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', cards[-1])
    time.sleep(0.5)

ids = [card.get_attribute('data-job-id') for card in cards]
titles = [title.text for title in driver.find_elements(By.CSS_SELECTOR, '.job-card-list__title')]
companies = [company.text for company in driver.find_elements(By.CSS_SELECTOR, '.job-card-container__company-name')]

job_time_experience, job_employees_sector, job_description = [], [], []
for card in cards:
    driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', card)
    time.sleep(0.5)
    card.click()
    # WebDriverWait(, 10) means that it waits maximum 10 seconds for the element to become visible
    job_time_experience .append( WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'li.jobs-unified-top-card__job-insight:nth-child(1)'))).text )
    job_employees_sector.append( WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'li.jobs-unified-top-card__job-insight:nth-child(2)'))).text )
    job_description     .append( WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.jobs-description'))).text )

# print results
pd.DataFrame({'Job ID':ids, 'Job title':titles, 'Company name':companies, 'Time & Exp':job_time_experience, 'Employees & Sector':job_employees_sector, 'Description':job_description})

output

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM