簡體   English   中英

從 web 頁面抓取多個帖子

[英]Scraping multiple posts from a web page

我試圖抓取頁面上的每一項工作,但我沒有成功。 我一直在嘗試不同的方法,但我沒有成功。 打開並抓取第一個作品后,腳本會崩潰。 我不知道接下來我應該做什么才能繼續下一份工作。 有沒有人幫我讓它工作? 先感謝您。 我不得不縮短代碼,因為它不允許我全部發布(代碼太多)。

 # Part 1 from selenium import webdriver import pandas as pd from bs4 import BeautifulSoup from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.chrome.options import Options from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager options = Options() driver = webdriver.Chrome(ChromeDriverManager().install(), options=options) df = pd.DataFrame(columns=["Title","Description",'Job-type','Skills']) for i in range(25): driver.get('https://www.reed.co.uk/jobs/care-jobs?pageno='+ str(i)) jobs = [] driver.implicitly_wait(20) for job in driver.find_elements_by_xpath('//*[@id="content"]/div[1]/div[3]'): soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser') element = WebDriverWait(driver, 50).until( EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler"))) element.click() try: title = soup.find("h3",class_="title").text.replace("\n","").strip() print(title) except: title = 'None' sum_div = job.find_element_by_css_selector('#jobSection42826858 > div.row > div > header > h3 > a') sum_div.click() driver.implicitly_wait(2) try: job_desc = driver.find_element_by_css_selector('#content > div > div.col-xs-12.col-sm-12.col-md-12 > article > div > div.branded-job-details--container > div.branded-job--content > div.branded-job--description-container > div').text #print(job_desc) except: job_desc = 'None' try: job_type = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[2]/div/div/div[3]/div[3]/span').text #print(job_type) except: job_type = 'None' try: job_skills = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[6]/div[2]/ul').text #print(job_skills) except: job_skills = 'None' driver.back() driver.implicitly_wait(2) df = df.append({'Title':title,"Description":job_desc,'Job-type':job_type,'Skills':job_skills},ignore_index=True) df.to_csv(r"C:\Users\Desktop\Python\newreed.csv",index=False)

在我看來,使用 selenium 管理 Chrome 比使用 firefox 或 edge 更棘手。 如果不需要 chrome,那么我會嘗試使用 firefox 或 Edge 驅動程序。 當 Chrome 給我帶來問題時,我對 Edge 很幸運。

您應該避免使用 Selenium(它最初不是為 web 抓取而設計的)。 您應該調查 F12 -> Network -> html 或 xhr 選項卡。

這是我的代碼:

import requests as rq
from bs4 import BeautifulSoup as bs

def processPageData(soup):
    articles = soup.find_all("article")
    resultats = {}
    for article in articles:

        resultats[article["id"][10:]] = {}

        res1 = article.find_all("div", {"class", "metadata"})[0]
        location = res1.find("li", {"class", "location"}).text.strip().split('\n')
        resultats[article["id"][10:]]['location'] = list(map(str.strip, location))
        resultats[article["id"][10:]]['salary'] = res1.find("li", {"class", "salary"}).text

        resultats[article["id"][10:]]['description'] = article.find_all("div", {"class", "description"})[0].find("p").text

        resultats[article["id"][10:]]['posted_by'] = article.find_all("div", {"class", "posted-by"})[0].text.strip()
    
    return resultats

在前面的 function 上迭代:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
           "Host": "www.reed.co.uk"}
            
resultats = {}

for i in range(1, 10):
    url = " https://www.reed.co.uk/jobs/care-jobs?pageno=%d" % i

    s = rq.session()
    resp = s.get(url, headers=headers)#.text
    soup = bs(resp.text, "lxml")
    r = processPageData(soup)
    resultats.update(r)

給出:

{'42826858': {'location': ['Horsham', 'West Sussex'],
  'salary': '£11.50 - £14.20 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

 '42827040': {'location': ['Redhill', 'Surrey'],
  'salary': '£11.00 - £13.00 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

....

注意 1: resultats鍵是標識符,如果您需要更多詳細信息,您可以導航到工作頁面。

注 2:我從 1 到 10 的頁面進行迭代; 但是您可以嘗試調整代碼以使其具有最大頁數。

注3:(作為一般建議)盡量理解網站的數據model,而不是過多嘗試,除非以錯誤的方式使用selenium。

注 4: css 選擇器和 xpath 選擇器很難看; 更喜歡按標簽進行更清潔的選擇。 (個人觀點)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM