繁体   English   中英

Web 使用 BeautifulSoup 抓取 LinkedIn - 无法从特定工作页面抓取,在工作列表页面的 href 中给出

[英]Web Scraping LinkedIn with BeautifulSoup - can't scrape from specific job page, given in the href on job listings page

我最近才开始编码,我正在尝试构建一个 LinkedIn web 刮板。 在抓取职位列表后,我无法抓取 LinkedIn 上的特定职位页面。 具体来说,我想获取工作页面上的申请人数。 抓取职位列表页面效果很好,但是当我使用为作业 url 获得的 href 时,我似乎无法从中抓取。 任何帮助将不胜感激。

import csv
import requests
from bs4 import BeautifulSoup
  
file = open('linkedin-jobs53.csv', 'a', newline='')
writer = csv.writer(file)
writer.writerow(['Title', 'Company', 'Location', 'Salary', 'List Date', 'Early Applicant?', 'Job Link', 'No of Applicants'])

# Skills and Place of Work etc.
skill = input('Enter Job or Skill: ').strip()
place = input('Enter the Location: ').strip()
onsiteinput = input('On-site, Hybrid, Remote or All?: ').lower().strip()
if onsiteinput =='all':
    onsite = 'f_WT=1%2C2%2C3'
elif onsiteinput =='on-site' or onsiteinput =='onsite' or onsiteinput =='on site':
    onsite = 'f_WT=1'
elif onsiteinput =='hybrid':
    onsite = 'f_WT=2'
elif onsiteinput =='remote':
    onsite = 'f_WT=3'

def jobdetails(job_link):
    no_of_applicants = ''
    response2 = requests.get(job_link)
    soup2 = BeautifulSoup(response2.content, 'html.parser')
    job_details = soup2.find_all('div', class_='job-view-layout jobs-details')    
    for detail in job_details:
        no_of_applicants = detail.find('span', class_='jobs-unified-top-card__applicant-count').text.strip()
        return no_of_applicants
        #writer.writerow([None,None,None,None,no_of_applicants])   
        
def linkedin_scraper(webpage, page_number):
    no_of_applicants = ''
    next_page = webpage + str(page_number)
    print(str(next_page))
    response = requests.get(str(next_page))
    soup = BeautifulSoup(response.content,'html.parser')
  
    jobs = soup.find_all('div', class_='base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card')
    for job in jobs:
        job_title = job.find('h3', class_='base-search-card__title').text.strip()
        job_company = job.find('h4', class_='base-search-card__subtitle').text.strip()
        job_location = job.find('span', class_='job-search-card__location').text.strip()
        job_link = job.find('a', class_='base-card__full-link')['href']
        
        try:
            salary = job.find('span', class_='job-search-card__salary-info').text.strip()
        except:
            salary = 'no info'
        
        try:
            list_date = job.find('time', class_='job-search-card__listdate').text.strip()
        except:
            list_date = 'no info'
            
        try:
            early_applicant = job.find('span', class_='result-benefits__text').text.strip()
        except:
            early_applicant = 'no info'
            
        jobdetails(job_link)
        
        writer.writerow([job_title,job_company,job_location,salary,list_date,early_applicant,job_link, no_of_applicants])
   
    print('Data updated')
    if page_number <= 0:    #if page_number <= 0
        #file.close() # add when testing/using API
        page_number = page_number + 25 # remove when testing/using API
        linkedin_scraper(webpage, page_number) # remove when testing/using API
    else:
        file.close()
        print('File closed')
  
#without geoid
linkedin_scraper('https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?' + onsite + '&keywords='+ skill + '&amp;location=' + place + '&amp;trk=public_jobs_jobs-search-bar_search-submit&amp;position=1&amp;pageNum=0&amp;start=', 0)        
#linkedin_scraper('http://api.scraperapi.com?api_key=dc8204963a3815a4bcbc7d58c3fc8c67&amp;url=https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords='+ skill + '&amp;location=' + place + '&amp;trk=public_jobs_jobs-search-bar_search-submit&amp;position=1&amp;pageNum=0&amp;start=', 0)

#with geoid, no input though
#linkedin_scraper('https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Product%20Management&amp;location=San%20Francisco%20Bay%20Area&amp;geoId=90000084&amp;trk=public_jobs_jobs-search-bar_search-submit&amp;position=1&amp;pageNum=0&amp;start=', 0)
#linkedin_scraper('http://api.scraperapi.com?api_key=dc8204963a3815a4bcbc7d58c3fc8c67&amp;url=https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Product%20Management&amp;location=San%20Francisco%20Bay%20Area&amp;geoId=90000084&amp;trk=public_jobs_jobs-search-bar_search-submit&amp;position=1&amp;pageNum=0&amp;start=', 0)

我无法理解你的问题,这里有什么意义。 你能更清楚一点吗?

但是我发现如果要换页,可以在url的末尾加上&start=25是作业列表的第二页,然后在url的末尾加上&start=50就是第三页,这样就是(页number-1)*25 更改页面,但对于第 1 页,您只需在 url 的末尾什么都不放。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM